Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

RSS::FeedFinder -> grace-driven or totally permiscious?

When I started blogs4God back on July 29, 2002, I had but a couple hundred links to deal with. Now, we’re getting close to 700 links and it’s getting hard to write-up caches on who said what. So I’ve finally started in earnest my blogs4God aggregation project.

The first step is to see who’s updated their pages most recently. Only one problem, using LWP doesn’t help me out with server-assisted &/or dynamically generated pages, such as those that end in .shtml or .php.

Fortunately, those who are slick enough to use server-side includes or server-side scripting languages are also usually slick enough to provide some sort of RSS, RDF or XML syndication file — which is static and offers a very good substitute for finding out when a page was modified.

So what I needed to do was write a program that would use LWP to get the last-modified date of the page. If the date failed, then call a program that would seek out and find an associated syndication file. So I tried Aaron Straup Cope’s RSSAutodiscovery module, only people such as yours truly tend to make all sorts of omissions and mistakes in offering a link.

So I wrote a Perl module named RSS::FeedFinder which looks for anything closely resembling a link or a reference to an RSS, RDF or XML feed, adds it to a weighed list based upon search criteria. One can then retreive the entire list, or just the entry for a particular method/weight.

Yeah, that was total geek-speak. Okay, it employs some really light-weight heuristics to give you the best shot at syndication file. It also needs alot of work. The code is ugly, and there are about a dozen things I could do better. But it works for me for now — so here is your chance to improve on it:


See what I mean? Needs work. That said, if you’re interested in how the pros do it, then I’d suggest a quick visit on over to Mark Pilgrim’s RSS Parser Project. I know I am, because that’s the next step.

2 Comments

  1. Actually, the cite is now http://diveintomark.org/archives/2002/08/15/ultraliberal_rss_locator.html

    And the project you probably want to look at it http://diveintomark.org/projects/rss_finder/ , not rss_parser.

    Two small things:
    1. Set your User-Agent to something specific and obey robots.txt as you’re following links. I believe there’s a RobotUA module for this purpose. http://search.cpan.org/dist/libwww-perl/lib/LWP/RobotUA.pm

    2. Content-type can be application/xml, application/rdf+xml, application/rss+xml. Not just text/… (Hell, I’ve seen RSS feeds served as text/html, but whatever. Some people are just beyond saving.)

    Other than that, very cool. I’m all for smart software that works hard for me.

  2. Mark,

    Excellent suggestions! I won’t have time to get to them till early next week, but you’re so right about robots.txt.

    And thanks for the headsup on application/xml — I get so bleary eyed at 3 am !-)