Heal Your Church WebSite

Teaching, rebuking, correcting & training in righteous web design.

Robots.txt generator including nasty bots blocker

Not all subdirectories were made for all people; at least not to see. For example, you may not want a Google cache exposing the mistakes of a test site you’re working on. Similarly, you may have a photo albumn that you’d rather not find itself showing up on search.Yahoo.com. How do you avoid various search engines indexing these directories while including the rest? Simple …

A long time ago, on a blog far away we discussed the ‘Robot Exclusion Tutorial‘ using just enough geek so you don’t shoot your foot clean off. Meaning, by simply using the robots exclusion standard, you can usually keep most ‘well behaved‘ search engines from ‘spidering’ into semi-private directories.

Note the emphasis on ‘well behaved.’ There are some nasty-bots out there that of course look at such entries as an engraved invitation to sneak a peak. Mark Pilgrim wrote about such nere-do-wells, even setting up a form of a ‘honey pot’ to ‘nail the suckas.’

That said, while searching for various Robots.txt validators, I came across a tool that generates a robots.txt file based upon entries you make … along with offering an option to include ‘nasty bots’ though I think from the prompts on the page, the webmaster needs to get two things straight.

First, you’re not really ‘blocking’ anything but requesting that a search engine not traverse up a stated path. Second, you don’t make an entry like ‘www.yoursite.com/private/ ‘ but rather ‘/private’.

Once you get past that the robots.txt generator tool works as advertised. Once you’re done with that, you may also want to visit my post listing posts on blocking sites using mod_rewrite via .htaccess and other such fun.

Too bad we just can’t send all the spammers on a rocket ship to the sun or something.


  1. Remember that anything you want to be really private, and that isn’t linked TO from elsewhere, should NOT be included in your ROBOTS.TXT files.

    Often ROBOTS.TXT is read as a way to find out what people *don’t* want you looking into, so it can give away otherwise-private areas.

    Don’t be too trusting, ROBOTS.TXT helps keep good guys (or their software) out of your stuff, but it can lead the less-good guys right to it.

  2. Amen to the above post. That is exactly what I was about to say. If I don’t want someone to get into something, it needs a .htaccess password on it so engines OR people cannot get into it.

    Plus, hackbots just love indexing no-no areas, I am sure. ;)

  3. Note that robots.txt is just a “suggestion” to an indexing bot–it may or may not actually obey your suggestion. And if non-compliant bot A indexes your content, and google indexes it off their page–you’ll end up in Google anyway.

  4. We get a surprising (to me) amount of traffic from image searches. For example, “fiddler on the roof photos” is regularly among the top 10 queries from Google. I suppose this is just overhead, since these people are interested in Fiddler on the Roof, not our church. I’ve toyed with the idea of banning image spiders from some directories, but have decided not to. We do not even approach our monthly bandwidth limit. Short of requiring an login to see the photos, I don’t think there is a reliable way to prevent the photos from “escaping” into various search engines. If you do put photos on your site, you should have a policy (to protect the privacy of your members) that your church is comfortable with.

  5. Don’t forget open directories. If you have folder listing enabled so that someone playing with urls at your site to see http://www.mysite.com/images/ can see a list, consider putting in an index.htm, default.htm, (or whatever default filename your server is set for) with either blank contents or a jump back to the homepage. That way you can keep people, and bots, from seeing the listings of those open directories.

    … or disable the folder listings of course… though there are times where you’d like that on some places but not others…