Search robots (wanderers, spiders) are programs which index web-documents in Internet.
In 1993-94 it was discovered, that search robots often perform documents indexing against will of web-site owners. Sometimes, robots interfered with common users and the same files were indexed several times. In some cases robots indexed wrong documents - very deep virtual directories, temporary information or CGI-scripts. Exclusions Standard was designed to solve such problems.
It is necessary to create a file containing information about robot's behaviour management to avoid robot's request to a web-server or its parts. This file must be available by HTTP protocol at local URL '/robots.txt'. Content of this file is given below.
This solution was made to allow robot find rules describing its required actions by requesting just one file. File '/robots.txt' can be easily created on any existing web-server.
The choice of such particular URL is dictated by several circumstances:
Format and semantics of '/robots.txt" are:
The file must have one or several records separated by one or several lines (ending with CR, CR/NL, or NL). Each record must contain lines: "<field>:<optional_space><value><optional_space>".
Field <field> is register independent.
Comments may be included in usual UNIX way: symbol '#' denotes start of a comment, end of line denotes end of a comment.
A record should start with one ore more 'User-Agent' lines followed by one ore more Disallow lines (see format below). Unrecognized lines are ignored.
If file '/robots.txt' is empty, does not conform to the format and semantics, or is missing, then search robots act according to their settings.
# robots.txt for http://www.site.com User-Agent: * # this is an infinite virtual URL space Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear
Content of '/cyberworld/map/' and '/tmp/' are protected in this example.
# robots.txt for http://www.site.com User-Agent: * # this is an infinite virtual URL space Disallow: /cyberworld/map/ # Cybermapper knows where to go User-Agent: cybermapper Disallow:
In this example the search robot 'cybermapper' is granted full access, while the rest do not have access to content of '/cyberworld/map/'.
# robots.txt for http://www.site.com User-Agent: * Disallow: /
Any search robot is denied access to the server in this example.