Mark Pilgrim – How Aggregating Indeed

When I posted my blog “How Aggregating – Google to Launch News Search Site“, I followed up by emailing Mark Pilgrim to see if he was going to offer a cool Python interface that might lob news from the Google API over to the Blogger API. His reply caught me off guard because:

  1. He pointed out an ambiguity in my original post as I implied the Google API could be approached with XML-RPC.
  2. He also brought to my attention an ethical issue I overlooked because I was viewing things from a purely technical point of view.

Hence, I’ve categorized this message under take a plank out of my eye – and am posting his messages because they are informative, instructive and accurate:

> Perhaps some Python for the following idea?

Not a chance. Google goes to great lengths to block all scrapers and other scripts that try to automatedly pull content from anywhere on their site. Their SOAP API only covers the main search results (no image search, no directory search, no groups search, no news search). In other words, unless they provide an interface for it, it’ll be next to impossible to grab the raw data and repurpose it.

Here is Mark’s reply when I asked permission to reprint the above email:

Please do. It’s an important point in general, that web services are up to the producer, not the consumer. I know there’s lots of unauthorized scraping and such going on in the world, but Google fights that pretty hard. We can all evangelize the idea of providing RSS feeds, but if they don’t want to do it, we can’t — and shouldn’t — route around them.

Thanks MARK! Sometimes I get so keyed up with new toys and ideas, I sometimes forget that it’s only fun until I put someone gets hurt!

