Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

Using Cron with LWP::Simple and XML::RSS to retrieve news feeds

Originally published on March 24, 2003 when the war in Iraq was heating up and I found direct links to popular RSS news feeds were effecting the speed in which pages loaded on a friend’s blog whom I help maintain. I’m re-posting this article for reasons that will become obvious later this week. Until then, enjoy this “Spidering Hack!-)”

Adding some syndicated news feeds is a nice way of adding some compelling content to your site.

The problem is that sometimes the news feed gets overrun during heavy news days, go offline and/or suffers a host of other connectivity issues that make YOUR site load slow because the software holds your user hostage while the feed retrieval portion of the application has to wait to timeout. You see this alot with PHPNuke and PostNuke sites.

A simple way around this problem is to use a program that periodically retrieves the feed, slices-n-dices and effectively caches it into an easy to include file on your host. Doing this achieves five goals:

  1. user page loads are not penalized when feeds go down
  2. failures to connect do not harm the existing include file
  3. multiple attempts to read the feed to not penalize user
  4. feed can be mirrored for local/private use
  5. content can be formatted to taste

Below is a little program I wrote Thursday to grab news feeds from an AP Wire I found via Scripting.com for inclusion on a the website of a friend who makes his living in the political area.

Using the following CRONTAB syntax, the program is executed every 30 minutes:
30 * * * * /home/YOURPATH/getap.pl>/dev/null

The nice thing about this approach is that this particular feed does “get busy” from time to time and at one point on Friday went offline. My users did not notice because in most cases, I was able to get by the “busy signal” on the 2nd or 3rd attempt out of 10. In the case where the feed site went offline, my users merely viewed and older include file without interruption or delay.

Anyway, since I haven’t posted anything worthwhile in the past few days, I figured this was a good penance:

#!/usr/bin/perl -w
# ———————————————————————–
# copyright Dean Peters © 2003 – all rights reserved
# http://www.HealYourChurchWebSite.com
# ———————————————————————–
#
# getap.pl is free software. You can redistribute and modify it
# freely without any consent of the developer, Dean Peters, if and
# only if the following conditions are met:
#
# (a) The copyright info and links in the headers remains intact.
# (b) The purpose of distribution or modification is non-commercial.
#
# Commercial distribution of this product without a written
# permission from Dean Peters is strictly prohibited.
# This script is provided on an as-is basis, without any warranty.
# The author does not take any responsibility for any damage or
# loss of data that may occur from use of this script.
#
# You may refer to our general terms & conditions for clarification:
# http://www.healyourchurchwebsite.com/archives/000002.shtml
#
# For more info. about this code, please refer to the following article:
# http://www.healyourchurchwebsite.com/archives/000760.shtml
#
# combine this code with crontab for best results, e.g.:
# 30 * * * * /home/YOURPATH/getap.pl>/dev/null
#
# ———————————————————————–
use XML::RSS;
use LWP::Simple;
# get content from feed — using 10 attempts
  my $content = getFeed("http://www.goupstate.com/apps/pbcs.dll/section?Category=RSS04&mime=xml", 10);

# save off feed to a file — make sure you have write access to file or directory
saveFeed($content, "newsfeed.xml");

# create customized output
my $output = createOutput($content, 8);

# save it
saveFeed($output, "newsfeed.inc.php");
sub getFeed {
     my ($url, $attempts) = @_;
     my $lc = 0;		# loop count
     my $content;
     while($lc $outfile") || die("Cannot Open File $outfile");
         print OUT $content;
     close(OUT);
}
sub createOutput {
     my ($content, $feedcount) = @_;

     # create new instance of XML::RSS
     my $rss = new XML::RSS;

     # parse the RSS content into an output string to be saved at end of parsing
     $rss->parse($content);
     my $title = $rss->{'channel'}->{'title'};
     my $output = "GoUpstate/AP NewsWire\n";
     my $i = 0;
     foreach my $item (@{$rss->{'items'}}) {
         next unless defined($item->{'title'}) && defined($item->{'link'});
         $i += 1;
         next if $i > $feedcount;
         $output .= "<a>{'link'}\"&gt;$item-&gt;{'title'}</a>\n";
     }

    # if a copyright &amp; link exists then post it
    my $copyright = $rss-&gt;{'channel'}-&gt;{'copyright'};
    my $link = $rss-&gt;{'channel'}-&gt;{'link'};
    my $description = $rss-&gt;{'channel'}-&gt;{'description'};
    $output .= "  <a>$copyright</a>\n"	if($copyright &amp;&amp; $link);
    $output .= "";
    return $output;
}

Of course, now I need to go ahead and practice what I preach and do the same here!

5 Comments

  1. This is off-topic, but would it be possible to write a TrackBack program for Blogspot? Has anybody done that already?

  2. There is a standalone trackback tool (http://www.movabletype.org/docs/tb-standalone.html) for movabletype, however this requires access to a server with CGI access, something that most people have. Maybe you could find a nice person though who would host the file for you?
    Actually if you read the very last part of the documentation they write…
    “(Possible use) 3. Centralized tool
    This TrackBack tool requires that the end user have the ability to run CGI scripts on their server. For many users (eg BlogSpot users), this is not an option. For such users, a centralized system (based on this tool, perhaps) would be ideal. ”

    Hopefully someone will go with it, like those people that run websites that do remote comments

  3. Interesting, it’d be great if somebody would start a service like that. I’d do it, but I’m not really that techno-literate.

  4. I’ve taken your newsfeed code above and added some modifications as follows:

    I linked to a Moreover RSS feed. There are many available and you can see all feeds available and implement the one you want at http://w.moreover.com/categories/category_list_rss.html

    Added a line to delete the old include file before creating a new one. My server did not want to overwrite the old include (don’t really understand why because the directory is CHMOD 777).

    Added a line to include the content description from the feed.

    Increased the number of feed items output to 30. Most RSS contain approx. 15.

    Added target=”_blank” to all links. Opens the newsfeed items in a separate window.

    Revised Code is available at http://www.bvmc.org/pub/getap.txt

    Remember to change the first line
    #!/usr/local/bin/perl -w
    to meet your specific needs. Mine needs a ‘local’ statement. Many do not.

    I load the “getap.pl” file to my cgi directory and CHMOD to 755.

    I create a “news” subdirectory under by cgi directory and CHMOD 777

    I place an empty “newsfeed.xml” file in the “news” directory and CHMOD 777

    I telnet in and “perl getap.pl” to get it all running.

    I set up my CRON to run the getap.pl once per hour (you can set yours to the desired frequency you want).

    I created a ‘shtml’ file to hold the feed results. You can see it at http://www.bvmc.org/world_news.shtml

    Good luck. Any ideas or other feedback is appreciated – especially if anything wrong is noted herein (Do not fear the reproach of men or be terrified by their insults. Isaiah 51:7).

    Jim Konicki, Webmaster, Blessed Virgin Mary of Czestochowa Church – Latham NY (www.bvmc.org)

  5. Just asking if I could run the code in IIS, because there isn’t XML::RSS module for it, I tried to use XML::RSS::Parser but it didn’t work, any help?
    But if it works, it’s great, specially because it is in PERL.