Linux HowTOs               February 04th, 2012   
  Main Menu
· Home
· AvantGo
· Feedback
· Linux Forums
· Members List
· Private Messages
· Recommend Us
· Reviews
· Search
· Stories Archive
· Submit News
· Surveys
· Top 10
· Topics
· Web Links
· Your Account
  Sponsor Links
  Downloads
  • Freshmeat.net
  • LinuxSoftware.org
  • Linux Tucows
  • FileWatcher.org
  •   More Links
  • All Linux Devices
  • Linux Start
  • Linux Tip
  • PHPBuilder
  •   JokeCrazy
    ·Daddy, how was I born?
    ·George Bush goes to a Primary School
    ·Just Trying To Order Chinese
    ·Texas Chili Contest
    ·Statue Fantasy
    ·The honeymoon is over
    ·Yo mama is so stupid
    ·Penal Kick
    ·The Environmentalist and the Executive
    ·Black Magic

    read more...
      Powered By

    Powered by NukeZone


      robots.txt - Excluding Pages From Search Engines
      Posted by: LearningLinux
    The other day I had a major problem with a site I have and when checking the stats the amount of bandwith data hit over 2 gigs in a month. Upon doing some research I found it was one of the AltaVista web crawlers or robots checking each and everyone of the of pages on my site including a script I was using for users to search using Google. Not a good thing. As soon as I found this out I did some research on the robots.txt to help control the crawlers or robots as I will refer to them from this point onwards.

    Before you get your site listed on all those search engines stop and think for a moment. Do you want ALL your pages indexed? Search engines are great. You only have to submit a single URL to a search engine and, once it knows your site exist, it will automatically search all your other pages using a robot and add these pages within a few weeks to its index. However this invasiveness can cause problems if you do not want certain pages indexed as I found out when checking my stats. For example do you want your order forms, customised error pages, confirmation pages etc. listed on the search engines? Probably not. So what can you do to prevent it.

    There are two techniques. One is to use a special META tag you can include in the page you don't want indexed. The other is to create a robot text file. We will deal with the robot text file this time around and another day for the META tags.

    Robot Text Files - robot.txt

    The first thing a robot does when it visits you site is to look for a file called "robots.txt". If the file exists it will follow the instructions contained within it. If there is no robots.txt file present then you are giving it free reign to examine any page it wishes.

    By including a robots.txt file you can indicate exactly what is, and what is not off-limits to all, or just some robots. Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*' in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like "Disallow: /tmp/*" or "Disallow: *.gif". What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Use your favorite text editor and set it out like this:
    --------- CUT -----------
    # robots.txt file for http://www.yoursite.com
    # This character signifies a comment tag
    
    User-agent: webcrawler
    Disallow:
    
    User-agent: altavista
    Disallow: /
    
    User-agent: *
    Disallow: /forms
    Disallow: /logs
                    
    --------- CUT -----------
    
    Any line starting with '#' specifies a comment. Use it for your own information.

    The first paragraph after the comments is specific to the robot called 'webcrawler' and states that webcrawler has nothing disallowed so it is free to go anywhere.

    The second paragraph indicates that the robot called 'altavista' is effectively barred from your entire site.

    The last paragraph indicates that all other visiting robots should not visit URLs starting with /forms or /log. The '*' is not a wildcard but a special character. You cannot use wildcard patterns or other expressions in the User-agent or Disallow fields.

    You also cannot string lines together like this:
    --------- CUT -----------
    User-agent: *
    Disallow: /forms /logs /errors /tmp
    
    --------- CUT -----------
    

    You must create a new Disallow line for each entry like this:
    --------- CUT -----------
    User-agent: *
    Disallow: /forms
    Disallow: /logs
    Disallow: /errors
    Disallow: /tmp
    
    --------- CUT -----------
    


    More Examples

    To exclude all robots from the entire server
    --------- CUT -----------
    User-agent: *
    Disallow: /
    
    --------- CUT -----------
    

    To allow all robots complete access
    --------- CUT -----------
    User-agent: *
    Disallow:
    
    --------- CUT -----------
    
    Or create an empty "/robots.txt" file.

    To exclude all robots from part of the server
    --------- CUT -----------
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /tmp/
    Disallow: /private/
    
    --------- CUT -----------
    

    To exclude a single robot
    --------- CUT -----------
    User-agent: BadBot
    Disallow: /
    
    --------- CUT -----------
    

    To allow a single robot
    --------- CUT -----------
    User-agent: WebCrawler
    Disallow:
    
    User-agent: *
    Disallow: /
    
    --------- CUT -----------
    

    To exclude all files except one. This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "docs", and leave the one file in the level above this directory:
    --------- CUT -----------
    User-agent: *
    Disallow: /~joe/docs/
    
    --------- CUT -----------
    

    Alternatively you can explicitly disallow all disallowed pages:
    --------- CUT -----------
    User-agent: *
    Disallow: /~joe/private.html
    Disallow: /~joe/foo.html
    Disallow: /~joe/bar.html
    
    --------- CUT -----------
    

    The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/learninglinux/foo/" or "/tmp/", or /foo.html:
    --------- CUT -----------
    # robots.txt for http://www.example.com/
    
    User-agent: *
    Disallow: /cyberworld/map/ # This is an infinite virtual URL space
    Disallow: /tmp/ # these will soon disappear
    Disallow: /foo.html
    
    --------- CUT -----------
    

    This example "/robots.txt" file specifies that no robots should visit any URL starting with "/cyberworld/map/", except the robot called "cybermapper":
    --------- CUT -----------
    # robots.txt for http://www.example.com/
    
    User-agent: *
    Disallow: /cyberworld/map/ # This is an infinite virtual URL space
    
    # Cybermapper knows where to go.
    User-agent: cybermapper
    Disallow:
    
    --------- CUT -----------
    

    Once you are happy save the file as "robots.txt" and move it to your root directory of your site, i.e. where your default page resides.

    NOTE: Since the Robots Exclusion Protocol is not acknowledged by all web robot authors, it is not possible to stop all robots from wandering your site. However, the good news is that majority of the well known search engines and tools support this protocol. Refer to their documentation to verify this.

    Tips

    1) Use the robots.txt file if possible

    2) ROBOTS META tag should be used if you can't create the above file. It's okay to use both methods if possible.

    3) If you know which robots you're trying to prevent from indexing your pages, a particular search engine for example, go to the source of the robot and remove your page if possible. In other words, many search engines are providing ways for you to remove your URLs from their indexes without having to use any of the above methods.

    4) Make your page stand-alone if possible. Meaning, remove links to the page that you're trying to keep away from robots. More links there are to your page the easier it is for a search engine robot to find your page. If your page is already in search engine indexes, it's too late to take this preventative step.

    5) If you must have absolute protection from robots, password protect those pages in question. Since all other methods are "agreements" that both parties must acknowledge in order for them to work in full, preventing the page from being served is the only way to guarantee that robots will not be able to touch your pages.

    Just follow these simple rules and you should have no problems. If want to verify your robots.txt check this URL at http://www.tardis.ed.ac.uk/~sxw/robots/check/

    Let us know if this article helped you! Thanks!

    LearningLinux Webmaster




     
      Related Links
    · More about Networks
    · News by LearningLinux


    Most read story about Networks:
    Making use of that old computer or maybe picking up a second or third

      Article Rating
    Average Score: 4
    Votes: 7


    Please take a second and vote for this article:

    Excellent
    Very Good
    Good
    Regular
    Bad

      Options

     Printer Friendly Printer Friendly

     Send to a Friend Send to a Friend






    Search:  CanadaSEEK :: Search Devil


      
    Powered by NukeZone Hosting
    All logos and trademarks in this site are property of their respective owner.
    The comments are property of their posters, all the rest © 1999-2004 by LearningLinux.
    You can syndicate our news using the file backend.php or ultramode.txt
    PHP-Nuke Copyright © 2004 by Francisco Burzi. This is free software, and you may redistribute it under the GPL. PHP-Nuke comes with absolutely no warranty, for details, see the license.
    Page Generation: 0.07 Seconds