Guidelines on Securing Public Web Servers
Web administrators who wish to limit bots' actions on their Web server need to create a plain
text file named robots.txt. The file must always have this name, and it must reside in the
Web server's root document directory. In addition, only one file is allowed per Web site.
Note that the robots.txt file is a standard that is voluntarily supported by bot programmers.
There is no requirement that it be used. Thus, malicious bots (such as EmailSiphon and
Cherry Picker) will ignore this file.
18
The robots.txt is a simple text file that contains some keywords and file specifications. Each
line of the file is either blank or consists of a single keyword and its related information. The
keywords are used to tell robots which portions of a Web site are excluded.
The following keywords are allowed:
User agent
is the name of the robot or spider. A Web administrator may also
include more than one agent name if the same exclusion is to apply to each specified
bot. The entry is not case sensitive (in other words googlebot is the same as
GOOGLEBOT and GoogleBot ).
A * indicates this is the default record, which applies if no other match is found. For
example, if you specify "GoogleBot" only, then the "*" would apply to any other robot.
Disallow
tells the bot(s) specified in the user agent field which sections of the Web
site are excluded. For example, /images informs the bot not to open or index any files
in the images directory or any subdirectories. Thus, the directory "/images/special/"
would not be indexed by the excluded bot(s).
Note that /do will match any directory beginning with "/do" (e.g. /do, /document, /docs,
etc.), whereas /do/ will match only a directory named "/do/".
A Web administrator can also specify individual files. For example, the Web
administrator could specify /mydata/help.html to prevent only that one file from being
accessed by the bots.
A value of just / indicates that nothing on the Web site is allowed to be accessed by the
specified bot(s).
At least one disallow per user agent record must exist.
There are many ways to use the robots.txt file. Some simple examples are as follows:
To disallow all (compliant) bots from specific directories:
User agent: *
Disallow: /images/
Disallow: /banners/
Disallow: /Forms/
Disallow: /Dictionary/
18
Other methods for controlling malicious bots exist; however, they are changing constantly as the malicious bot
operators and Web administrators develop new methods of counteracting each other's techniques. Given the
constantly changing nature of this area, discussion of these techniques is beyond the scope of this document.
33
Unlimited Web Hosting
|
|
TotalRoute.net Business web hosting division of Vision Web Hosting Inc. All rights reserved. |