A cURLy tale – PHP and feeds and text mining
I have an interest in what might be called “website interchange” (sending data from one site to another) and in text mining of blog material. So, naturally I have to work with RSS, and at the server level read and parse it with PHP. (This also has a bit to do with XML-RPC).
If this makes your eyes glaze over, and you can’t see what possible relevance it has to you and your data-mining needs, consider this. There are about a gazillion websites out there, we are awash with data, consumers use the web as their first port of call in many cases, and are increasingly using aggregators and simplifiers to customize their needs – so you need, in some QUANTITATIVE manner – to understand and monitor that environment.
If you have never seen RSS – but I bet you have when you pressed one of those orange RSS “feed” buttons by accident – I have a typical grungy example below. Look at it and weep. But it is not meant to be easily human readable, it is indeed XML. A “feed” is simply an XML file that the server constructs up from its database maybe at a scheduled time. So, other machines can request that “feed” (just asking for a file, really) whenever they want.
If you are at all interested in the mechanics of constructing an RSS feed, there is a nice simple tutorial here. http://www.shadow-fox.net/tutorial/Building-an-RSS-Feed-From-a-Database. Or here http://www.webreference.com/authoring/languages/xml/rss/custom_feeds/index.html. Or a quite nice looking “FeedCreator” class here http://www.bitfolge.de/rsscreator-en.html
But you are probably not. Most probably your feed is already built for you if you are using one of the standard blogging packages. But custom feeds can be built, no drama.
Reading the feed, at the server
This is the situation where you have the feed on one server (remote machine) and another server requests it, maybe to display it or to do some aggregation as a way of structured content building.
It is more complex than downloading the feed to a custom desktop app, the sort of thing you might build in a compiled language (Delphi, of course) if you were researching, say, a set of competitors.
If we are working server-side, we have to play by the server’s rules and this is where we have to talk PHP and CURL. I am assuming PHP because, well because it is out there on most servers and it does a lot of this sort of stuff very well.
But there is a catch. Security. We are trying to read a file from a remote server, and that may open a hole. And in any event your host might have disabled it. DreamHost ( a very professional hosting company that I use a lot) says
What.. compile my own version of PHP? I don’t think so. Not before breakfast, anyway.
cURL hey. CURL is a tool for transferring files with URL syntax, and can be used from the command line. It is quite handy – for example you could use it for bulk downloads.
Now the problem is that I want to use it in conjunction with XML parsing (RSS is XML, remember) and my starting point was a code snippet from Jim Wintergren (http://www.jimwestergren.com/tutorial-feed-your-sites-by-blogging/ which owes a lot to http://www.shadow-fox.net/site/tutorial/37-Building-Content-By-Parsing-RSS-Feeds-With-PHP ), which breaks just about here
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen($file, "r"))) {
die("could not open XML input");
}
Now, the wiki at dreamhost http://wiki.dreamhost.com/index.php/Using_XML_Parser_without_allow_url_fopen had this to say
They have, thankfully, allowed libcurl connections. Alas, XML_Parser (which is where the http open was failing) doesn’t fallback to libcurl, so I had to hack it.
Starting at line 698 inside the setInputFile function of Parser.php, this is what I now have:
if (eregi('^(http|ftp)://', substr($file, 0, 10))) {
if (!ini_get('allow_url_fopen')) {
## time to do the curl hack
$ch = curl_init($file);
$fp = tmpfile();
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp); $file = $tempFile;
WHAT. YET ANOTHER HACK!!??
I am philosophically opposed to hacking into supplied library code (one reason why I like “sealed” precompiled languages like Delphi and C# much much better).
Actually, it is more than just a philosophical objection. It is both a pragmatic one based on many years of experience, and a scientific/statistical one.
These are systems of substantial complexity.. poke at them, change the behaviour of one bit.. and we have destabilized them. Who knows what the consequences might be, consequences for OTHER apps?
So, I am humble and conservative.
Humble in that I don’t believe I can read a few thousand lines of code, pick an intervention point to do a hack, and maintain the confident stance that it is still quite safe.
Conservative in that I am somewhat Bayesian in my approach . I believe, with quite high probability, that the existing system works “as advertised” and can be treated as a black box. A black box which I can couple with my easily understandable bit of functionality (which because it is short and straightforward is reasonably assured of being correct) and, because they are two independent boxes I can have a good notion of the combined probability of success (just the product of the two probabilities, both close to 1).
So, no hacking.
What I ended up doing was simple in the extreme. Used cURL to read to a temporary file, and passed that to the parser. Like so.
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
// we now use curl to get around the disabling of
// url_fileopen
if (!ini_get('allow_url_fopen')) {
// do the curl hack
$ch = curl_init($file);
$fp = tmpfile();
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
}
rewind($fp);
while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {
die(sprintf("XML error: %s at line %d",
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
}
xml_parser_free($xml_parser);
Actually, there is even more elegance at http://wiki.dreamhost.com/index.php/CURL .. no hacks, just add on functionality encapsulated in classes.
A dream.
Conclusion
Almost certainly more code and detail than you are interested in. And we did not do anything interesting with the feed that we received at the parser – just did a dumb display on one server what was fed from another server.
Yet this is the basis for an intelligent, perhaps clustered or categorized or filtered or assigned to a “management review (deferred learning) box”, organization and display of multiple feeds from multiple sources. All data mining starts with managing the data.
Many interesting possibilities, imho.
A typically grungy bit of RSS, for your delectation, in case you have never seen such a cutie before
<rss version="2.0">
−
<channel>
<title>Furl - Latest Entries</title>
<link>http://www.furl.net/furled.jsp</link>
<description>Furl archive.</description>
<docs>http://backend.userland.com/rss</docs>
<generator>Furl (http://www.furl.net)</generator>
−
<item>
<title>BeadingProject</title>
<link>http://www.furl.net/forward.jsp?id=16782152</link>
<description/>
<category>video</category>
<dc:creator>drgaal</dc:creator>
<guid isPermaLink="true">http://www.furl.net/item.jsp?id=16782152</guid>
<pubDate>Tue, 20 Feb 2007 22:27:54 GMT</pubDate>
<furl:rating>3</furl:rating>
<furl:clipping/>
</item>
−
<item>
<title>Dev Articles</title>
<link>http://www.furl.net/forward.jsp?id=16782151</link>
<description/>
<category>sql server</category>
<dc:creator>anilcolin</dc:creator>
<guid isPermaLink="true">http://www.furl.net/item.jsp?id=16782151</guid>
<pubDate>Tue, 20 Feb 2007 22:27:55 GMT</pubDate>
<furl:rating>3</furl:rating>
<furl:clipping/>
</item>
−
<item>
−
<title>
Tower reviews City beautification programs - Sulphur Southwest Daily News
</title>
<link>http://www.furl.net/forward.jsp?id=16782150</link>
<description/>
<category>general</category>
<dc:creator>sheldon47brown</dc:creator>
<guid isPermaLink="true">http://www.furl.net/item.jsp?id=16782150</guid>
<pubDate>Tue, 20 Feb 2007 22:27:53 GMT</pubDate>
<furl:rating>3</furl:rating>
<furl:clipping/>
</item>