Fetch Full Web Page Only For Some Sites

Asked by Jonathan A Richmond

Hi,

I've spent literally the last 12 hours trying to figure this out myself with no success. For certain feeds (mainly Engadget), Readability doesn't extract the image/videos along with the full article text. I'm hosting Full-text RSS locally, and I'm trying to figure out a way to patch out the Readability routine only for certain explicitly listed sites, and simply display the entire fetched web page instead.

In Engadget's case, I would also like to modify the full-text links that the script follows to point to the i.engadget.com version of the article rather than the original link embedded in the RSS post (all this requires is replacing the "www" with "i").

It seems like these should be pretty simple changes, but I don't have much experience with PHP and for the life of me, I cant figure out how to do it. Would it be possible to guide me through making these changes? Thanks a lot, and thanks for a great service!

Jonathan

Question information

Language:
English Edit question
Status:
Solved
For:
Five Filters Edit question
Assignee:
No assignee Edit question
Solved by:
Keyvan
Solved:
Last query:
Last reply:
Revision history for this message
Best Keyvan (keyvan) said :
#1

Jonathan: If you're processing feeds, in the code there's a call to grabArticleHtml() on line 361: http://bazaar.launchpad.net/~keyvan/fivefilters/content-only/annotate/head%3A/makefulltextfeed.php#L361 - that's the call to Readability. If you comment that line you'll have the entire HTML without Readability attempting to extract content from it.

If this is what you want to do, you can create a condition where Readability only runs if the URL ($permalink) does not contain the string 'engadget'.

To change the URL before the content is fetched, you'll need to move a little further up the file to line 359. http://bazaar.launchpad.net/%7Ekeyvan/fivefilters/content-only/annotate/head%3A/makefulltextfeed.php#L359 Before that line, check if $permalink points to engadget (like above), and if it does, do str_replace('www.', 'i.', $permalink);

Hope that helps,

Keyvan

Revision history for this message
Jonathan A Richmond (jonathanrichmond) said :
#2

That's great! Thank you so much. Now here's another question--if you can help me with this I'll be ecstatic: Is there a way to easily call some Javascript from PHP? Instead of using Readability, I'd like to use the MUCH better Readable App (http://readable-app.appspot.com/), which is written in Javascript. I understand that you ported Readability over from Javascript, which makes me think that there's no easy way to just call some Javascript code within PHP otherwise you would have done so, but hopefully I'm wrong about that? Is there any way you can think of to replace the Readability routine with Readable App? Thanks a ton!

Jonathan

Revision history for this message
Jonathan A Richmond (jonathanrichmond) said :
#3

Thanks Keyvan, that solved my question.

Revision history for this message
Keyvan (keyvan) said :
#4

Jonathan: Glad that helped. As for your other question. No, PHP and Javascript are different languages - you cannot call one from the other. I haven't used Readable, but have come across it before. It seems to be doing the same thing Readability does. What do you prefer about it? You might be interested in a more recent port of Arc90's Readability if the old version does not work well for you: http://www.keyvan.net/2010/08/php-readability/

Revision history for this message
Jonathan A Richmond (jonathanrichmond) said :
#5

Okay, thanks. What I prefer about Readable is that it is much more accurate in extracting relevant content, retains media like photos and videos, and formats and presents the resulting content in a much more readable AND true-to-the-original way. Is there any way to implement some of those benefits of Readable into your PHP port of Readability--especially the keeping of photos and videos, as well as parsing of multipage articles?

Revision history for this message
Keyvan (keyvan) said :
#6

Jonathan, I think the recent version of PHP Readability I linked to above will be an improvement over version 0.4 that was included in this package. It's more accurate and retains more images and videos.

As for presentation - the PHP port does not aim to present content as the original JS versions do. The aim of the port is to enable easy content extraction. I've deliberately left out presentation so developers can decide how they want to present the content. In the case of the Full-Text RSS service, it's used for feed creation - it's up to feed readers to render the contents in a suitable form.

Regarding multipage articles, that was a feature added in Readability 1.7. My PHP port does not support it yet, but I will likely include support for it at some point. :)