How to Block Spiders From Visiting and Indexing Your Site :
Robots.txt File Tutorial
There are reasons you might not want your Web site to be indexed
by search engines. More likely, there are simply certain pages that
you don't want indexed by the major search engines.
For instance, maybe you constructed an elaborate direct marketing
site that requires the visitor to enter through your main page and
then proceed through a highly structured series of links that lead
them to a buying decision. The internal pages would only confuse
visitors who entered through those pages and they would be less
likely to buy a product or service.
Whatever your reason, there is a standard that you can implement
that will keep most of the major search engine spiders from
indexing your Web site.
Here's how to block the spiders. Create a file called "robots. txt"
that includes the following code:
user-agent: * Disallow: /*
The first line specifies the agents, browsers or spiders that should
read this file and adhere to the instructions in the following lines
of code. The second line stipulates which files or directories the
spider or browser should not read or index. The example above uses
the "/*" which means the agent should not read or index anything
as the asterisks denotes "everything."
The robots. txt file must be placed in the root directory of your
Web site. What this means is that if you are hosting your Web site
using one of the free services and your domain looks something like
this:
http:// members. aol. com/ Joesmith/ home. htm you cannot use
the robots. txt file to keep out the spiders, since you don't have
a primary domain name. The primary domain name is aol. com - and
America Online will probably not allow you to block all the search
engines spiders from indexing their site and the Web sites of the
11 million other subscribers.
This robots. txt file could look like this if there were specific
directories and files that you wish the search engines not to index:
user-agent: * Disallow: /clients/*
Disallow: /products/* Disallow: /pressrelations/*
Disallow: /surveys/ survey. htm
In the above example the robots. txt file asks the search engines
spider to omit all pages within the following directories:
http:// www. yourcompany. com/ clients/ http:// www. yourcompany.
com/ products/
http:// www. yourcompany. com/ pressrelations/
And the following specific page:
http:// www. yourcompany. com/ survey/ survey. htm
If you are one of the millions of people hosting a Web site on America
Online's server or one of the other free or subdirectory Web site
services and you can't place a robots. txt file in their root directory,
you can use a META tag that talks to some of the spiders:
<META NAME=" ROBOTS" CONTENT=" NOINDEX">
You will need this META tag on every page in your Web site that
you don't want indexed. If your Web site has 30 or 40 pages (or
more), this will take a lot of time. Here's another reason to buy
a good HTML editor like Luckman's WebEdit or Allaire's HomeSite.
These programs allow you to do a global search and replace and add
an HTML tag to every Web page that you open in the program. As with
all META tags, this META tag goes at the top of your HTML document
between the <HEAD> and </ HEAD> tags. |