Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

find-a-bot.sh – a nice little script to ID bots bugging your website site

a nice little script to ID bots bugging your websiteAlready demonstrating earlier this week how to block spambots and rogue spiders. Today I’m completing the lesson with a nice little bash script sample that can help you identify some of these non-browser ‘candidates’ by parsing your access logs and placing the results in an easy-to-read text file.

In other words, this script will selectively find most non-browser user agents that appear in your access logs like this:

24.190.239.220 - - [29/May/2008:05:16:19 -0700] "GET /about HTTP/1.1" 200 628 "-" "Java/1.6.0_06"
79.71.205.134 - - [29/May/2008:00:56:34 -0700] "GET / HTTP/1.1" 200 12888 "-" "Site Sniper Pro"

And turns it into a slightly saner and sorted output like this:

24.190.239.220 [29/May/2008:05:16:19 "Java/1.6.0_06"
79.71.205.134 [29/May/2008:00:56:34 "Site Sniper Pro"

Here is what your bash script might look like on a site running WordPress on shared host like DreamHost ... I'll explain some of the mechanics afterwards:

#!/bin/bash
#
# step 1 - modify these so you get paths like this:
#   /home/YOURROOT/YOURDOMAIN.coM/...
#
myroot="YOURROOT"
mydomain="YOURDOMAIN.COM"

#
# step 2 - leave alone if these days & formats work for you:
#
TERM=linux
export TERM
tdy=`date +%d%b%y`
ydy=`date -d '1 day ago' +%Y-%m-%d`
dby=`date -d '7 day ago' +%Y-%m-%d`
logfile="access.log.$ydy"

#
# step 3 - modify if you're using something other
#           than  WordPress on DreamHost
#
outfile="/home/$myroot/$mydomain/findabot"
logpath="/home/$myroot/logs/$mydomain/http/"
csspath="/home/$myroot/$mydomain/wp-content"

#
# step 4 - mother of all parsing statements, parse to taste
#	(note this version DOES sort)
#
# 	remember \ at the very end of line equals
#	bash line continuation of a command set
#
grep "$csspath" -v $logpath$logfile | \
  egrep " \"(Mozilla|Opera)\/[0-9]| \"BlackBerry[0-9]{4}" -v | \
  perl -l -a -n -e 'print $F[0]," ",$F[3]," ",$F[11]," ",$F[12]," ",$F[13]' | \
  sort -n > $outfile/$ydy.txt

#
# step 5 - maintain a manageable archive
#
if [ -e $outfile/$dby.txt ]; then
	mv -f $outfile/$dby.txt $outfile/bak.txt
fi

Okay, step 1 basically means you login to your site either SSH or even FTP and before navigating anywhere, issue the “pwd” command so you can determine your YOURROOT and YOURDOMAIN (though the latter may likely be your website’s url).

Step 2 is how we get date stamps for our input and output files. I found a nice simple example of date variable formatting of these over on an ExpressionEngine manual – but they’ll work in your bash script just fine.

Also, that line containing “7 day ago” can be modified to indicate how many days worth of logs you want to keep active. Similarly, the prior line containing “1 day ago” means you want to parse yesterday’s logs.

Step 3 is basically how I use variables to define file and directory paths based on what I coded for steps 1 and 2.

Step 4 combines all the elements from the above steps and taking a page out of my April 2nd article entitled ‘How to quickly check your error logs for oddities‘ issues a consecutive stream of grep and/or egrep commands.

Sometimes leveraging the ‘-v’ command to exclude elements, most noteably when I’m excluding known user agent strings for browsers.

This done, a bit of PERL command line magic is used to parse out the fields we want, where afterwards the selected data is sorted and piped into the output file defined in step 3.

Step 5 takes into account that logs can get big, so this is where we manage an archive … based on step 2 … for 7 days worth of entries.

find-a-bot gets into the bits and bytes of web site bottageIf you’re not familiar with creating bash scripts, you may encounter situations where you need to “chmod” or even “chown” the file to get it to work.

The next step – though not documented above – is to test the script and when you’re sure it’s working, modify your crontab file so your batch runs every night, like say 2:15 AM while you and everyone else are sleeping. Here’s what my crontab entry looks like:

15 2 * * * /home/YOURROOT/find-a-bot.sh > /dev/null

I’ve provided a .txt version of the file you can simply download from here.

Moreover, I’ve created a slightly more complex version to download of the above for use on a system running a something like vBulletin on a root or virtual private server operating with Fedora or RedHat.

The point is, while the above appears a bit complex, I can assure you it’s worth running as it can help you quickly discern over the course of a few days:

  • how often and how hard spambots are sniffing your system
  • how much of your bandwidth is consumed by feed readers versus browsers
  • which feed readers are hammering away at your site, ignoring your <skiphours /> and/or <skipdays /> data
  • how much bandwidth you might save by exporting your sermon’s RSS feeds to a service like FeedBurner
  • what spiders are ignoring your robots.txt file
  • tips on unusual visitors from interesting places from unique user agents
  • whether or not some of the comment spam is via “Mozilla-like”agents who botch their user agent string
  • how many of your visitors are infected with spyware
  • how many of your visitors are trying to hide their tracks by visiting you with an anonymous proxy firing blank user agent strings
  • how many spamblogs are leaching your compelling content

Like I said, it will require just a little bash script know how, so with that, I leave you with these tutorials:

Oh and if you’re nice and leave a comment, I might even email you a link to my own archive of greatest bot hits over the past few days.

Especially if you share your own scripting recipes for spotting bots.

Comments are closed.