Heal Your Church WebSite


Teaching, rebuking, correcting & training in righteous web design.

Robot Exclusion Tutorial

I received a very nice email from Carolyn at the Faith Evangelical Church in Melrose, MA. She writes:

My husband sent me this URL to improve the church website, but I feel over my head. In [point #] 7 you say. “What robot exclusion standard?” What is that?

Carolyn, as I like to say my 11th grade Sunday school class “there are no dumb questions, so thank you for asking … in doing so you help the rest of us learn something new.” Or put another way … ask and you shall receive!

Here is how it works. When Google or some other (legitimate/well-behaved) search engine visits your site, they first look for a file in your root/home directory named robots.txt. This little file contains simple, line-at-a-time instructions on what directories and files you would or would not like indexed. Here is an example of mine:

User-agent: *
Disallow: /cgi-bin
Disallow: /images

The first line says “User-agents” for ALL user agents, allow/disallow the following directories. The next two lines specify that I don’t want search engines to index anything in my /cgi-bin or /images subdirectory, mostly because it saves me bandwidth. Now I know what you’re thinking … “what the heck is a user agent?

A user-agent is how we identify what type of software is visiting our site. What us geeks sometimes refer to as a “client application.” For example, many of my visitors are identified as using Microsoft Internet Explorer version 6, though today, because yesterday’s article found its way to Linux.org, the majority of my visitors are using the Mozilla browser. Google identifies itself as “googlebot.”

If I wanted, I could make an entry specific to Google that says I also don’t want it to visit nor index a file on my site named “googleme.html” by adding the following entry in my robots.txt file:

User-agent: googlebot
Disallow: googleme.html

Of course, there are those who abuse robots.txt. So the general rule of thumb is, if none of your web pages has a link to a private directory, then don’t list it in robots.txt. Looking at it the other way around, you should only allow/disallow subdirectories are linked on any of your web pages. For that, may I suggest reading my March 01’03 post entitled “How to block spambots, ban spybots, and tell unwanted robots to go to ….

I could go on, but there is a MUCH better tutorial over at SearchEngineWorld. Not only does it go into greater detail on with useful real-world examples, but it also accompanied by their Robots.txt Validator … an online program that lets you check to see if your robots.txt file is kosher.

And if this doesn’t help, feel free to email me or leave a comment. I’ll just keep explaining it until everyone in the class gets it.

2 Comments

  1. Hey, I love the site. That tutorial there was actually really helpful… I’m going to have to impliment that.

    I’m the senior webmaster at 7x teen ministries. I was wondering if you’d perhaps consider a link exchange?
    check it out at http://www.sev-x.com .
    Praise God, and may he bless you!

  2. I agree this was helpful. The tutorial and additional information was informative especially the “bad examples”. Back to my web site for some work.