Internet Spiders and other creapy creatures on the net
Created | Updated Mar 20, 2004
Internet spider are programs running on a local computer. They do not really 'live' on the internet. They are just programs dedicated to access sites in attemp to find certain information. This can be key words or images for a search engine. That is, the good ones are, there are also spiders or webbots to harvest email addresses.
A decent spider only accesses a site every 30 seconds. There are rules to make these spiders not to overload a site. There are also 'noindex' and 'robots.txt' messages on a site to inform spiders not to access certain areas. The question remains whether or not the program will obey these signs.
These programs are known by several names:
Web Spider
Web Bot
Web Crawler
F19585?thread=217110&post=250333
F115116?thread=257344&post=3130112
The file 'robots.txt' should reside in your web root directory, that is the directory where a browser comes first.
To disallow any access this file should contain the following lines:
User-agent: *
Disallow: /
This text is copied and pasted to this page, you should also as robots do not have spelling checkers.
The alternative but also wise to be used is a HTML tag in the header section of your document.
If you do not want robots to harvest any links from a page nor index this page you can use this tag.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
These do not guarantee anything, they just notify the harvesting program of the desired behavior.
Email addresses not to be seen by robots can be masked in various ways. Here on H2G2 the only possible way is using:
emailname<IDENTITY TYPE="#64"/>hostaddress.domain
This is no guarantee your email address will not be harvested. However you have made it clear it is not your intention to supply the address to spiders.