Guidelines on Securing Public Web Servers
Web administrators who wish to limit bots' actions on their Web server need to create a plain 
text file named  robots.txt.   The file must always have this name, and it must reside in the 
Web server's root document directory.  In addition, only one file is allowed per Web site.  
Note that the robots.txt file is a standard that is voluntarily supported by bot programmers.  
There is no requirement that it be used.  Thus, malicious bots (such as EmailSiphon and 
Cherry Picker) will ignore this file.
18
The robots.txt is a simple text file that contains some keywords and file specifications.  Each 
line of the file is either blank or consists of a single keyword and its related information.  The 
keywords are used to tell robots which portions of a Web site are excluded. 
The following keywords are allowed: 
    
User agent
   is the name of the robot or spider.  A Web administrator may also 
include more than one agent name if the same exclusion is to apply to each specified 
bot.  The entry is not case sensitive (in other words  googlebot  is the same as 
 GOOGLEBOT  and  GoogleBot ). 
A  *  indicates this is the  default  record, which applies if no other match is found.  For 
example, if you specify "GoogleBot" only, then the "*" would apply to any other robot.  
    
Disallow
   tells the bot(s) specified in the user agent field which sections of the Web 
site are excluded.  For example, /images informs the bot not to open or index any files 
in the images directory or any subdirectories.  Thus, the directory "/images/special/" 
would not be indexed by the excluded bot(s). 
Note that /do will match any directory beginning with "/do" (e.g. /do, /document, /docs, 
etc.), whereas /do/ will match only a directory named "/do/".  
A Web administrator can also specify individual files.  For example, the Web 
administrator could specify /mydata/help.html to prevent only that one file from being 
accessed by the bots. 
A value of just  /  indicates that nothing on the Web site is allowed to be accessed by the 
specified bot(s). 
At least one disallow per user agent record must exist. 
There are many ways to use the robots.txt file.  Some simple examples are as follows: 
    
To disallow all (compliant) bots from specific directories: 
User agent: * 
Disallow: /images/ 
Disallow: /banners/ 
Disallow: /Forms/ 
Disallow: /Dictionary/ 
                                                   
18
 Other methods for controlling malicious bots exist; however, they are changing constantly as the malicious bot 
operators and Web administrators develop new methods of counteracting each other's techniques.  Given the 
constantly changing nature of this area, discussion of these techniques is beyond the scope of this document.   
33




Unlimited Web Hosting




TotalRoute.net Business web hosting division of Vision Web Hosting Inc. All rights reserved.