The other day I had a major problem with a site I have and when checking the stats the amount of bandwith data hit over 2 gigs in a month. Upon doing some research I found it was one of the AltaVista web crawlers or robots checking each and everyone of the of pages on my site including a script I was using for users to search using Google. Not a good thing. As soon as I found this out I did some research on the robots.txt to help control the crawlers or robots as I will refer to them from this point onwards.
Before you get your site listed on all those search engines stop and think for a moment. Do you want ALL your pages indexed? Search engines are great. You only have to submit a single URL to a search engine and, once it knows your site exist, it will automatically search all your other pages using a robot and add these pages within a few weeks to its index. However this invasiveness can cause problems if you do not want certain pages indexed as I found out when checking my stats. For example do you want your order forms, customised error pages, confirmation pages etc. listed on the search engines? Probably not. So what can you do to prevent it.
There are two techniques. One is to use a special META tag you can include in the page you don't want indexed. The other is to create a robot text file. We will deal with the robot text file this time around and another day for the META tags.
Robot Text Files - robot.txt
The first thing a robot does when it visits you site is to look for a file called "robots.txt". If the file exists it will follow the instructions contained within it. If there is no robots.txt file present then you are giving it free reign to examine any page it wishes.
By including a robots.txt file you can indicate exactly what is, and what is not off-limits to all, or just some robots. Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif". What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Use your favorite text editor and set it out like this:
--------- CUT -----------
# robots.txt file for http://www.yoursite.com
# This character signifies a comment tag
User-agent: webcrawler
Disallow:
User-agent: altavista
Disallow: /
User-agent: *
Disallow: /forms
Disallow: /logs
--------- CUT -----------
Any line starting with '#' specifies a comment. Use it for your own information.
The first paragraph after the comments is specific to the robot called 'webcrawler' and states that webcrawler has nothing disallowed so it is free to go anywhere.
The second paragraph indicates that the robot called 'altavista' is effectively barred from your entire site.
The last paragraph indicates that all other visiting robots should not visit URLs starting with /forms or /log. The '*' is not a wildcard but a special character. You cannot use wildcard patterns or other expressions in the User-agent or Disallow fields.
You also cannot string lines together like this:
--------- CUT -----------
User-agent: *
Disallow: /forms /logs /errors /tmp
--------- CUT -----------
You must create a new Disallow line for each entry like this:
--------- CUT -----------
User-agent: *
Disallow: /forms
Disallow: /logs
Disallow: /errors
Disallow: /tmp
--------- CUT -----------
More Examples
To exclude all robots from the entire server
--------- CUT -----------
User-agent: *
Disallow: /
--------- CUT -----------
To allow all robots complete access
--------- CUT -----------
User-agent: *
Disallow:
--------- CUT -----------
Or create an empty "/robots.txt" file.
To exclude all robots from part of the server
--------- CUT -----------
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
--------- CUT -----------
To exclude a single robot
--------- CUT -----------
User-agent: BadBot
Disallow: /
--------- CUT -----------
To allow a single robot
--------- CUT -----------
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
--------- CUT -----------
To exclude all files except one. This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
--------- CUT -----------
User-agent: *
Disallow: /~joe/docs/
--------- CUT -----------
Alternatively you can explicitly disallow all disallowed pages:
--------- CUT -----------
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
--------- CUT -----------
The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/learninglinux/foo/" or "/tmp/", or /foo.html:
--------- CUT -----------
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
--------- CUT -----------
This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
--------- CUT -----------
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
--------- CUT -----------
Once you are happy save the file as "robots.txt" and move it to your root directory of your site, i.e. where your default page resides.
NOTE: Since the Robots Exclusion Protocol is not acknowledged by all web robot authors, it is not possible to stop all robots from wandering your site. However, the good news is that majority of the well known search engines and tools support this protocol. Refer to their documentation to verify this.
Tips
1) Use the robots.txt file if possible
2) ROBOTS META tag should be used if you can't create the above file. It's okay to use both methods if possible.
3) If you know which robots you're trying to prevent from indexing your pages, a particular search engine for example, go to the source of the robot and remove your page if possible. In other words, many search engines are providing ways for you to remove your URLs from their indexes without having to use any of the above methods.
4) Make your page stand-alone if possible. Meaning, remove links to the page that you're trying to keep away from robots. More links there are to your page the easier it is for a search engine robot to find your page. If your page is already in search engine indexes, it's too late to take this preventative step.
5) If you must have absolute protection from robots, password protect those pages in question. Since all other methods are "agreements" that both parties must acknowledge in order for them to work in full, preventing the page from being served is the only way to guarantee that robots will not be able to touch your pages.
Just follow these simple rules and you should have no problems. If want to verify your robots.txt check this URL at http://www.tardis.ed.ac.uk/~sxw/robots/check/
Let us know if this article helped you! Thanks!
LearningLinux Webmaster
|