Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

How to block spambots by user agent using .htaccess

How to block spambots by user agent using .htaccess .Spambots and spiders that ignore robots exclusion file can kill your site both in bandwidth and by potentially exposing information you don’t want ‘harvested.’ With that in mind, here is a quick-n-dirty guide to blocking spambots and rogue search engine spiders by using .htaccess. First the essential example codeblock, followed by a working example:

essential example codeblock

# redirect spambots & rogue spiders to the end of the internet
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^spambot
RewriteRule ^(.*)$ http://www.shibumi.org/eoti.htm#$1 [R=301,L]

Next is to read my article on how to quickly check your error logs for oddities … which should provide you with a list of all sorts of unusual user agents worth blocking.

With said list, all that is left to do is create a working version that instead of sending people to the end of the internet, blocks them outright – which is probably a better move then sending the traffic elsewhere:

real-world/working example

# redirect spambots & rogue spiders to the end of the internet
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector
RewriteRule .* - [F,L]

Note I provide 4 examples:

  1. ^$,
  2. ^EmailSearch
  3. ^Microsoft\ URL
  4. ^Web\ Image\ Collector

All to demonstrate how to use perl-like regular expressions parse out the user agent. For example:

  1. ^ – identifies the beginning of the user agent string
  2. $ – identifies the end of the user agent string
  3. \ – that is a slash with a space afterwards tells the parser to include the space between words
  4. [OR] – is placed after each of the multiple entries, except the last
  5. [NC,...] – is sometimes placed after an entry to scan it w/out concern to upper or lower case

In the process, I’m intentionally blocking empty user agents using .htaccess – “^$” – a search string that uses a regular express to test for nothing between the beginning “^” and end “$” of a user agent token. Sorry, but if you’re not willing to tell me who/what you are, I’m not willing to show you my content.

Also, be aware the above requires that you have mod_rewrite installed on your Apache server, and that you have privileges to create your own rewrite rules in your own .htaccess file. If you’re not sure, check with your hosting service and/or system administrator.

In most cases, such privs & access exists – but your mileage may vary – as they might in how your particular .htaccess file actually works in-the-wild.

That said, more tomorrow or Thursday on how to create cron job to list those “unusual user agents” ‘automagically‘ for easy identification – and if needed -anti-spam remediation.

6 Comments

  1. Pingback: r and s are regular expressions and

  2. Pingback: how to use microsoft works 2007 tutorials

  3. Pingback: linux script cron job

  4. I’ve across a handful of services that for whatever reason don’t send a User-agent. The most recent one was when I set up a new Google Analytics account, the validator from Google doesn’t send a user-agent. So I had to disable my htaccess rules temporarily until the site was “verified.”

  5. Pingback: Bookmarks about Cron

  6. As a webmaster, you definitely should use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

    I wrote more about this here:

    Webmaster Tips: Blocking Selected User-Agents
    http://faseidl.com/public/item/213126