Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

Screen-Scraper, the Application

Aggregation via RSS files is cool, but sometimes you just need a good-old fashioned screen scraping to get the data you want out of a particular web site. This usually means hauling out your favorite computer language and hurting yourself with regular expressions until your brain bleeds black ink.

Enter an interesting little open source tool, appropriately named “screen-scraper and screen-scraper.com.”

screen-scraper is a tool used to extract data from web sites, and consists of two main parts. The first is a proxy server that can be run locally in order to view in plain text the contents of web pages. The second is an engine that can be easily configured (by geeks) to extract information from web sites, handling such tasks as authentication to a site, following redirects, and automatic handling of cookies.

screen-scraper also has the ability to be run as a server, which allows for it to be invoked from external applications written in a variety of languages. The current suppported languages are Java, PHP, and any COM-friendly language such as Visual Basic or ASP.

Huh? What? Yo, Dean, I thought you were going to do a site review?! Yeah, me too. But work-work is busy … so is a redesign for a friend’s site. More on the later … uh later.

Here is the bottom line. You can use “screen-scraper to pick off information pages that interest you. Moreover, you can use patterns instead of hard-coded data. Meaning, you can identify a tag or some other element who’s text changes daily, but who’s format is generally the same. For example <div class=”datetitle”>16-Jul-2003</div> can be approached found not by looking for 16-jul-2003, but by looking for 2 digits, a dash, three characters, another dash, four more characters, all between a division named datetitle.

Then if you want, you can take the parsed information and shove it into an RSS file for easy aggregation.

Those of you who are married should have your wives complain to my better-half directly for bringing this cool tool to your attention just before the weekend. Let me know if you come up with anything fun and interesting.

3 Comments

  1. Just wanted to mention – and you might want to mention it next time you bring up screen scraping – you have to have permission to re-print material that you scrape off the web. In my understanding, you can’t legally just scrape someones site and print parts of it on your own site without giving proper credit. And in certain circumstances (commercial use) you often must have permission from the author to use even a little of their work – even if you give credit. But it’s usually not hard to get permission – especially if you offer to link a copyright notice to the author’s site. Just remember to ask before taking.

  2. …might also want to mention that this app costs over USD$300… not exactly in keeping with the “free scripts and stuff” feel of this blog, etc.

    - Alister

  3. There is a free version and a pro version. The pro version costs $300+US.