|
Categories
|
|
|
|
|
Creating a Robots.txt File
By
Sumantra Roy
|
|
Some people believe that they
should create different pages for different search
engines, each page optimized for one keyword and for one
search engine. Now, while I don't recommend that people
create different pages for different search engines, if
you do decide to create such pages, there is one issue
that you need to be aware of.
These pages, although optimized for different search
engines, often turn out to be pretty similar to each
other. The search engines now have the ability to detect
when a site has created such similar looking pages and are
penalizing or even banning such sites. In order to prevent
your site from being penalized for spamming, you need to
prevent the search engine spiders from indexing pages
which are not meant for it, i.e. you need to prevent
AltaVista from indexing pages meant for Google and
vice-versa. The best way to do that is to use a robots.txt
file.
You should create a robots.txt file using a text editor
like Windows Notepad. Don't use your word processor to
create such a file.
Here is the basic syntax of the robots.txt file:
User-Agent: [Spider Name]
Disallow: [File Name]
For instance, to tell AltaVista's spider, Scooter, not to
spider the file named myfile1.html residing in the root
directory of the server, you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Google's spider, called Googlebot, not to spider
the files myfile2.html and myfile3.html, you would write
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put multiple User-Agent statements in
the same robots.txt file. Hence, to tell AltaVista not to
spider the file named myfile1.html, and to tell Google not
to spider the files myfile2.html and myfile3.html, you
would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: Googlebot
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent all robots from spidering the file
named myfile4.html, you can use the * wildcard character
in the User-Agent line, i.e. you would write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use the wildcard character in the
Disallow line.
Once you have created the robots.txt file, you should
upload it to the root directory of your domain. Uploading
it to any sub-directory won't work - the robots.txt file
needs to be in the root directory.
I won't discuss the syntax and structure of the robots.txt
file any further - you can get the complete specifications
from here.
Now we come to how the robots.txt file can be used to
prevent your site from being penalized for spamming in
case you are creating different pages for different search
engines. What you need to do is to prevent each search
engine from spidering pages which are not meant for it.
For simplicity, let's assume that you are targeting only
two keywords: "tourism in Australia" and
"travel to Australia". Also, let's assume that
you are targeting only three of the major search engines:
AltaVista, HotBot and Google.
Now, suppose you have followed the following convention
for naming the files: Each page is named by separating the
individual words of the keyword for which the page is
being optimized by hyphens. To this is added the first two
letters of the name of the search engine for which the
page is being optimized.
Hence, the files for AltaVista are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for Google are
tourism-in-australia-go.html
travel-to-australia-go.html
As I noted earlier, AltaVista's spider is called Scooter
and Google's spider is called Googlebot.
A list of spiders for the major search engines can be
found here.
Now, we know that HotBot uses Inktomi and from this list,
we find that Inktomi's spider is called Slurp.
Using this knowledge, here's what the robots.txt file
should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-go.html
Disallow: /travel-to-australia-go.html
User-Agent: Googlebot
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines in the robots.txt file, you
instruct each search engine not to spider the files meant
for the other search engines.
When you have finished creating the robots.txt file,
double-check to ensure that you have not made any errors
anywhere in it. A small error can have disastrous
consequences - a search engine may spider files which are
not meant for it, in which case it can penalize your site
for spamming, or, it may not spider any files at all, in
which case you won't get top rankings in that search
engine.
An useful tool to check the syntax of your robots.txt file
can be found here. While it will help you correct
syntactical errors in the robots.txt file, it won't help
you correct any logical errors, for which you will still
need to go through the robots.txt thoroughly, as mentioned
above.
|
About
The Author
Article
by Sumantra Roy. Sumantra is one of the
most respected and recognized search
engine positioning specialists on the
Internet. For more articles on search
engine placement, subscribe to his 1st
Search Ranking Newsletter by going to: http://the-easy-way.com/newsletter.html |
|
|
|
<< Back to the Article Index
©
Copyright 2005, ArticleJunction.com
|
|
|