Dastar...
Your Source For Information
46443 Articles Listed





Search:



Screen scraping your way into RSS

Introduction
RSS is one the hottest technologies at the moment, and even big web publishers (such as the New York Times) are getting into RSS as well. However, there are still a lot of websites that do not have RSS feeds.

If you still want to be able to check those websites in your favourite aggregator, you need to create your own RSS feed for those websites. This can be done automatically with PHP, using a method called screen scrapping. Screen scrapping is usually frowned upon, as it's mostly used to steal content from other websites.

I personally believe that in this case, to automatically generate a RSS feed, screen scrapping is not a bad thing. Now, on to the code!

Getting the content
For this article, we'll use PHPit as an example, despite the fact that PHPit already has RSS feeds.

We'll want to generate a RSS feed from the content listed on the frontpage. The first step in screen scraping is getting the complete page. In PHP this can be done very easily, by using implode(file("", "[the url here]")); IF your web host allows it. If you can't use file() you'll have to use a different method of getting the page, e.g. using the CURL library.

Now that we have the content available, we can parse it for the content using some regular expressions. The key to screen scraping is looking for patterns that match the content, e.g. are all the content items wrapped in <div>'s or something else? If you can successfully discover a pattern, then you can use preg_match_all() to get all the content items.

For PHPit, the pattern that match the content is <div class="contentitem">[Content Here]<div>. You can verify this yourself by going to the main page of PHPit, and viewing the source.

Now that we have a match we can get all the content items. The next step is to retrieve the individual information, i.e. url, title, author, text. This can be done by using some more regular expression and str_replace() on the each content items.

By now we have the following code;

<?php

// Get page
$url = "http://www.phpit.net/";
$data = implode("", file($url));

// Get content items
preg_match_all ("/<div class="contentitem">([^`]*?)</div>/", $data, $matches);
Like I said, the next step is to retrieve the individual information, but first let's make a beginning on our feed, by setting the appropriate header (text/xml) and printing the channel information, etc.

// Begin feed
header ("Content-Type: text/xml; charset=ISO-8859-1");
echo "<?xml version="1.0" encoding="ISO-8859-1" ?>
";
?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<channel>
<title>PHPit Latest Content</title>
<description>The latest content from PHPit (http://www.phpit.net), screen scraped!</description>
<link>http://www.phpit.net</link>
<language>en-us</language>

<?
Now it's time to loop through the items, and print their RSS XML. We first loop through each item, and get all the information we get, by using more regular expressions and preg_match(). After that the RSS for the item is printed.
<?php
// Loop through each content item
foreach ($matches[0] as $match) {
// First, get title
preg_match ("/">([^`]*?)</a></h3>/", $match, $temp);
$title = $temp['1'];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match ("/<a href="([^`]*?)">/", $match, $temp);
$url = $temp['1'];
$url = trim($url);

// Third, get text
preg_match ("/<p>([^`]*?)<span class="byline">/", $match, $temp);
$text = $temp['1'];
$text = trim($text);

// Fourth, and finally, get author
preg_match ("/<span class="byline">By ([^`]*?)</span>/", $match, $temp);
$author = $temp['1'];
$author = trim($author);

// Echo RSS XML
echo "<item>
";
echo " <title>" . strip_tags($title) . "</title>
";
echo " <link>http://www.phpit.net" . strip_tags($url) . "</link>
";
echo " <description>" . strip_tags($text) . "</description>
";
echo " <content:encoded><![CDATA[
";
echo $text . "
";
echo " ]]></content:encoded>
";
echo " <dc:creator>" . strip_tags($author) . "</dc:creator>
";
echo " </item>
";
}
?>
And finally, the RSS file is closed off.
</channel>
</rss>
That's all. If you put all the code together, like in the demo script, then you'll have a perfect RSS feed.

Conclusion
In this tutorial I have shown you how to create a RSS feed from a website that does not have a RSS feed themselves yet. Though the regular expression is different for each website, the principle is exactly the same.

One thing I should mention is that you shouldn't immediately screen scrape a website's content. E-mail them first about a RSS feed. Who knows, they might set one up themselves, and that would be even better.

Download sample script

About the Author

Dennis Pallett is a young tech writer, with much experience in ASP, PHP and other web technologies. He enjoys writing, and has written several articles and tutorials. To find more of his work, look at his websites at http://www.phpit.net, http://www.aspit.net and http://www.ezfaqs.com
Item Bids Price Ends
NEW COMPAQ PRESARIO V3018US 14.1
NEW COMPAQ PRESARIO V3018US 14.1" GLOSSY LCD SCREEN
0
$69.99
3d 20h
NEW B154EW04 V.B 15.4
NEW B154EW04 V.B 15.4" WXGA GLOSSY LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW ACER ASPIRE 2000 15.4
NEW ACER ASPIRE 2000 15.4" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW NOKIA 5310 8600 6300 LCD DISPLAY SCREEN REPLACEMENT
NEW NOKIA 5310 8600 6300 LCD DISPLAY SCREEN REPLACEMENT
0
$29.99
3d 20h
NEW LTN140W1 14.0
NEW LTN140W1 14.0" WXGA LAPTOP LCD SCREEN
0
$109.00
3d 20h
NEW ACER TRAVELMATE 5050 14.1
NEW ACER TRAVELMATE 5050 14.1" WXGA GLOSSY LCD SCREEN
0
$69.99
3d 20h
PROTECH *Screen Protector* For the BlackBerry Bold 9000
PROTECH *Screen Protector* For the BlackBerry Bold 9000
0
$9.99
3d 20h
NEW COMPAQ PRESARIO V3018TU 14.1
NEW COMPAQ PRESARIO V3018TU 14.1" WXGA LCD SCREEN
0
$69.99
3d 20h
NEW B154EW04 V.2 15.4
NEW B154EW04 V.2 15.4" WXGA GLOSSY LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW ACER ASPIRE 1690 15.4
NEW ACER ASPIRE 1690 15.4" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW LTN140W1 14.0
NEW LTN140W1 14.0" WXGA GLOSSY LAPTOP LCD SCREEN
0
$109.00
3d 20h
NEW ACER TRAVELMATE 4720 14.1
NEW ACER TRAVELMATE 4720 14.1" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW COMPAQ PRESARIO V3015CA 14.1
NEW COMPAQ PRESARIO V3015CA 14.1" GLOSSY LCD SCREEN
0
$69.99
3d 20h
NEW B154EW02 V.6 15.4
NEW B154EW02 V.6 15.4" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW ACER ASPIRE 1690 15.4
NEW ACER ASPIRE 1690 15.4" WXGA GLOSSY LCD SCREEN
0
$69.99
3d 20h
NEW LP140WX1 14.0
NEW LP140WX1 14.0" WXGA LAPTOP LCD SCREEN
0
$109.00
3d 20h
NEW ACER TRAVELMATE 3280 14.1
NEW ACER TRAVELMATE 3280 14.1" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW COMPAQ PRESARIO V3010CA 14.1
NEW COMPAQ PRESARIO V3010CA 14.1" WXGA LCD SCREEN
0
$69.99
3d 20h
NEW B154EW02 V.5 15.4
NEW B154EW02 V.5 15.4" WXGA GLOSSY LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW ACER ASPIRE 1680 15.4
NEW ACER ASPIRE 1680 15.4" WXGA LAPTOP LCD SCREEN
0
$69.99
3d 20h
NEW LP140WX1 14.0
NEW LP140WX1 14.0" WXGA GLOSSY LAPTOP LCD SCREEN
0
$109.00
3d 20h
2.2 Inch Screen Guard for Camera MP3 Phone LCD Screen
2.2 Inch Screen Guard for Camera MP3 Phone LCD Screen
0
$2.11
3d 20h
2 x Screen Protector Film for iPod Touch 2nd Gen New
2 x Screen Protector Film for iPod Touch 2nd Gen New
0
$1.00
3d 20h
NEW ACER TRAVELMATE 3270 14.1
NEW ACER TRAVELMATE 3270 14.1" WXGA GLOSSY LCD SCREEN
0
$69.99
3d 20h
NEW COMPAQ PRESARIO V3010CA 14.1
NEW COMPAQ PRESARIO V3010CA 14.1" GLOSSY LCD SCREEN
0
$69.99
3d 20h
  View all items on eBayDisclaimer  
These results were generated from eBay on 01-6-2009 9:58 am.
eBay Marketplace
Screen scraping your way into RSS