PHP-readability to use DB of content definitions

Asked by Dither

I think php-readability must use main content definition database (possibly just a plain php array) to get page content from a known sources instead of statistical analysis as it's doing now, because it fails in a lot of cases. It's possible to combine those two approaches so php-readability will firstly try to find content definition in database and if there is no definition then use page analysis.

The possible way is to utilize microformat that consist of pageElement and pageURL entires where the first is XPath definition of main content and the second is RegExp definition of pages where this content is.

Question information

Language:
English Edit question
Status:
Answered
For:
Five Filters Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Dither (dither) said :
#1

You can see example database here: http://wedata.net/databases/LDRFullFeed/items

Revision history for this message
Keyvan (keyvan) said :
#2

Dither, thanks for the suggestion but the point of PHP Readability is to try and determine the content block without explicit pre-defined patterns (having said that Readability is aware of a few common microformats and does appropriately score recognised microformatted elements higher than other elements).

What you describe, however is possible to do outside of PHP Readability - ie. get the page content, find the xpath associated with a URL pattern, use the xpath to match elements, finally, and only if there's no match, invoke PHP Readability.

Revision history for this message
Dither (dither) said :
#3

Well as I see it the main purpose of readability project is not to extract content but to make it human readable so using it in rss content-extractor is kinda misuse. It works perfectly in a way of cleaning pages from a garbage but it's not so good in finding what to extract for RSS. I hope you can see my point that guesswork isn't always the best way for collecting data (except for statistical analysis or removing excessive data).

Revision history for this message
Dither (dither) said :
#4

I don't want to be officious but my point is that Readability alone isn't nearly enough to get plausible results.

To get it clearer I make an examples. There are several feeds from my list that worked poorly (but somehow worked) with a previous version of the extractor but do not work at all with the new one. And there are other cases - RSS of forum comments without full text where user want to get just that comment but obtaining random block from the first comment on page to the whole page with all comments or ignored pages when there are many images or when there is "read more" block and readability ignores short text block before it and a few other.

Revision history for this message
Dither (dither) said :
#5

Sorry if I misguided you earlier but I meant php-readability used in full-text-rss project.

Revision history for this message
Keyvan (keyvan) said :
#6

Dither: thank you for the feedback - I don't think you're being officious, it's great to hear what everyone thinks about this.

I did earlier think you were only talking about PHP Readability outside the context of the Full-Text RSS project. So, first, to clear things up, it's important to know whether we're talking about the original Readability (by Arc90), this PHP port (PHP Readability) or the Full-Text RSS code (which uses PHP Readability). :)

You say "the main purpose of readability project is not to extract content but to make it human readable". That's true of Arc90's Readability project, but not of PHP Readability. For the PHP port the main purpose is simply to extract a single HTML content block (I deliberately left out much of the original Arc90 code related to making it human readable). You can read more about the differences here: http://www.keyvan.net/2010/08/php-readability/

The content extraction code ported from Readability is intended for individual articles with at least a few passages of text. It will perform badly on pages which contain multiple articles, index pages for example, or HTML pages which feature mainly images or video. It does not guarantee accurate results. I've assumed that people who use Readability or PHP Readability are aware of this. Perhaps this needs to be stated more clearly.

Your next point is about about the use of PHP Readability in the Full-Text RSS service. You say it's "not so good in finding what to extract for RSS". That, again, depends on what the RSS points to. If you read long articles on blogs or sites which don't offer a full-text feed, then it will likely work very well. If you use it with RSS feeds for picture or video blogs, then it won't.

Having said that, I can understand what you're saying about making it more accurate with patterns and using PHP Readability as fallback - that's actually what I was doing in another project a few months ago. The problem with this approach is that maintaining a list of URLs and patterns is no easy task. And they constantly change. A community effort to maintain such a database would be nice - but I'm not sure how many people would contribute. I assume developers who depend on accurate results will already be using XPaths (and possibly Readability as fallback).

As far as the Full-Text RSS service is concerned, I will be including a few more checks to increase accuracy - e.g. using partial RSS content as clue to identifying correct content block, and analysing matched element ids to see if we have odd results within a feed. The next version of Full-Text RSS will let users specify a CSS selector (e.g. div#content-block). This will work as described earlier: if there's a match, PHP Readability detection will be skipped, if not, PHP Readability will run as normal.

Hope that explains my view on all this. :) Suggestions are always good though, feel free to comment if there's anything else...

Revision history for this message
Keyvan (keyvan) said :
#7

Dither, I just had another look at that URL you posted - I didn't see the total count before, or the about page - it does look like a very impressive list. Perhaps it is a community maintained list which I said would be hard to manage. If you've got any more info on it, please let me know and I'll look into it some more. I'm particularly interested in how broken XPaths are reported and amended.

Thanks again

Revision history for this message
Dither (dither) said :
#8

Yes it's community maintained - changes are applied by anyone who knows how to use regexp and xpath and have an openid. It has advantage of updatability as people can change other's outdated posts in wiki style. And I think it has a bit of moderator's attention sometimes. The base can be retrieved on demand in JSON format.

The main purpose of that database is to define page content for this script: http://userscripts.org/scripts/show/22702 It has similar purpose to your project. There is only japanese description but google-translate makes pretty neat job of resolving that.

Can you help with this problem?

Provide an answer of your own, or ask Dither for more information if necessary.

To post a message you must log in.