Aggregation via RSS files is cool, but sometimes you just need a good-old fashioned screen scraping to get the data you want out of a particular web site. This usually means hauling out your favorite computer language and hurting yourself with regular expressions until your brain bleeds black ink.
Enter an interesting little open source tool, appropriately named “screen-scraper and screen-scraper.com.”
screen-scraper is a tool used to extract data from web sites, and consists of two main parts. The first is a proxy server that can be run locally in order to view in plain text the contents of web pages. The second is an engine that can be easily configured (by geeks) to extract information from web sites, handling such tasks as authentication to a site, following redirects, and automatic handling of cookies.
screen-scraper also has the ability to be run as a server, which allows for it to be invoked from external applications written in a variety of languages. The current suppported languages are Java, PHP, and any COM-friendly language such as Visual Basic or ASP.
Huh? What? Yo, Dean, I thought you were going to do a site review?! Yeah, me too. But work-work is busy … so is a redesign for a friend’s site. More on the later … uh later.
Here is the bottom line. You can use “screen-scraper to pick off information pages that interest you. Moreover, you can use patterns instead of hard-coded data. Meaning, you can identify a tag or some other element who’s text changes daily, but who’s format is generally the same. For example <div class=”datetitle”>16-Jul-2003</div> can be approached found not by looking for 16-jul-2003, but by looking for 2 digits, a dash, three characters, another dash, four more characters, all between a division named datetitle.
Then if you want, you can take the parsed information and shove it into an RSS file for easy aggregation.
Those of you who are married should have your wives complain to my better-half directly for bringing this cool tool to your attention just before the weekend. Let me know if you come up with anything fun and interesting.