How to block spambots by user agent using .htaccess

2008 May 27
by MeanDean

How to block spambots by user agent using .htaccess .Spambots and spiders that ignore robots exclusion file can kill your site both in bandwidth and by potentially exposing information you don’t want ‘harvested.’ With that in mind, here is a quick-n-dirty guide to blocking spambots and rogue search engine spiders by using .htaccess. First the essential example codeblock, followed by a working example:

essential example codeblock

# redirect spambots & rogue spiders to the end of the internet
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^spambot
RewriteRule ^(.*)$ http://www.shibumi.org/eoti.htm#$1 [R=301,L]

Next is to read my article on how to quickly check your error logs for oddities … which should provide you with a list of all sorts of unusual user agents worth blocking.

With said list, all that is left to do is create a working version that instead of sending people to the end of the internet, blocks them outright - which is probably a better move then sending the traffic elsewhere:

real-world/working example

# redirect spambots & rogue spiders to the end of the internet
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^$ [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector
RewriteRule .* - [F,L]

Note I provide 4 examples:

  1. ^$,
  2. ^EmailSearch
  3. ^Microsoft\ URL
  4. ^Web\ Image\ Collector

All to demonstrate how to use perl-like regular expressions parse out the user agent. For example:

  1. ^ - identifies the beginning of the user agent string
  2. $ - identifies the end of the user agent string
  3. \ - that is a slash with a space afterwards tells the parser to include the space between words
  4. [OR] - is placed after each of the multiple entries, except the last
  5. [NC,...] - is sometimes placed after an entry to scan it w/out concern to upper or lower case

In the process, I’m intentionally blocking empty user agents using .htaccess - “^$” - a search string that uses a regular express to test for nothing between the beginning “^” and end “$” of a user agent token. Sorry, but if you’re not willing to tell me who/what you are, I’m not willing to show you my content.

Also, be aware the above requires that you have mod_rewrite installed on your Apache server, and that you have privileges to create your own rewrite rules in your own .htaccess file. If you’re not sure, check with your hosting service and/or system administrator.

In most cases, such privs & access exists - but your mileage may vary - as they might in how your particular .htaccess file actually works in-the-wild.

That said, more tomorrow or Thursday on how to create cron job to list those “unusual user agents” ‘automagically‘ for easy identification - and if needed -anti-spam remediation.

6 Comments leave one →
2008 May 28

[...] [...]

Pingback
2008 June 11

[...] non-browser ???candidates?? by parsing your access logs and placing the results in an easy-to-read thttp://healyourchurchwebsite.com/2008/05/27/how-to-block-spambots-by-user-agent-using-htaccess/Rotherham Council meets government eLearning targets PublicTechnology.netImagine turning up at [...]

Pingback
2008 July 7

[...] non-browser ???candidates?? by parsing your access logs and placing the results in an easy-to-read thttp://healyourchurchwebsite.com/2008/05/27/how-to-block-spambots-by-user-agent-using-htaccess/Recent Original Stories New Mobile Computing”Why should you spend money for an alarm clock or a [...]

Pingback
2008 July 19

I’ve across a handful of services that for whatever reason don’t send a User-agent. The most recent one was when I set up a new Google Analytics account, the validator from Google doesn’t send a user-agent. So I had to disable my htaccess rules temporarily until the site was “verified.”

2008 September 15

[...] - bookmarked by 6 members originally found by hotterchick2008 on 2008-08-23 Comment on How to block spambots by user agent using .htaccess by … [...]

Pingback
2008 September 20

As a webmaster, you definitely should use user-agent headers to manager server traffic. But understand that this is purely a pragmatic tactic and not a serious security measure.

I wrote more about this here:

Webmaster Tips: Blocking Selected User-Agents
http://faseidl.com/public/item/213126

Leave A Comment

You must be logged in to post a comment.