A robots.txt is a file placed on your server to tell the various search engine spiders not to crawl or index certain sections or pages of your website. You can use the robots.txt file to prevent indexing by spiders totally, prevent certain areas of your website from being indexed or to issue individual indexing instructions to specific spiders.
The file itself is a simple text file, which can be created in Notepad. It needs to be saved to the root directory of your website.
Is a robots.txt file needed?
Yes. Absolutely. No question.
Why is a robots.txt file important?
To prevent spiders from accessing certain files on your website, such as images, CSS and JavaScript files.
Sitemap directive (Google, Yahoo and Microsoft support a single format for submission of xml sitemaps).
Few minor spiders may take the absence of a robots.txt as a universal “disallow” and not index your website at all.
Missing robots.txt file could be interpreted as a sign of amateurism.
Disallow spiders that are abusive, using too much of your bandwidth or just causing a nuisance overall.
If you have two versions of a page (one for viewing in the browser and one for printing), you’d rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty.
And if all of that weren’t enough, Google’s official Webmaster Guidelines explicitly recommend their use.
Step-by-step directions for creating your own robots.txt file:
“User-agent” are search engines’ crawlers and Disallow: lists the files and directories to be excluded from indexing. In addition to “User-agent:” and “Disallow:” entries, you can include comment lines – just put the # sign at the beginning of the line:
# All user agents are disallowed to see the /temp directory.
User-agent: *
Disallow: /temp/
Exclude a file from an individual spider/search engine:
You have a file, addresses.html, in a directory called ‘emaillogs’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file:
User-Agent: Googlebot
Disallow: /emaillogs/addresses.html
Exclude a section of your website from all spiders:
You are building a new section to your website in a directory called ‘new’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*’, to exclude them all.
User-Agent: *
Disallow: /new/
Allow all spiders to index everything:
Once again you can use the wildcard, ‘*’, to let all spiders know they are welcome. The second disallow line you just leave empty.
User-agent: *
Disallow:
Allow no spiders to index any part of your website. This requires just a tiny change from the command above – be careful!
User-agent: *
Disallow: /
What can go wrong:
SEOBook recently posted an article about a mistake they made. While changing a robots.txt file, they accidentally blocked one of the most well linked to pages on the website. That caused the website’s search traffic to drop by half (right after Google was unable to crawl and index the URL). They estimated to be out about $10,000 in profit because of that one line of code. Here’s what happened:
Disallow: /*page
also blocks a file like this from being indexed in Google:
beauty-pageants.php
Tips for creating your own robots.txt file:
You can use Notepad (some people may tell you this won’t work, don’t listen to them.) Don’t use Wordpad, you must use Notepad.
Look at other websites and understand their robots.txt files to get started.
You don’t need to list every single directory in your robots.txt file. Certain directories may contain confidential information that is not linked from anywhere on the website and therefore, are not available to spiders anyway. Calling attention to them in your robots.txt file may help hackers find files that you don’t want made public. But, just in case you want to try this anyway, here are some tips:
If the directory you want to exclude or block is “awesomesauce” all you need to do is abbreviate it and add an asterisk to the end. You’ll want to make sure that the abbreviation is unique. You can identify the directory you want protected ‘/awesomesauce/’ and then add this line to your robots.txt file:
User-agent: *
Disallow: /awe*
This command will disallow spiders from indexing directories that begin with “awe.” You’ll want to double check your directory structure to make sure you won’t be disallowing any other directories that you wouldn’t want disallowed. For example, this directive would disallow the directory “awebs” if you had that directory on your server.
If you are using doorway pages (similar pages, each optimized for an individual search engine) you may wish to ensure that individual robots do not have access to all of them. This is important in order to avoid being penalized for spamming a search engine with a series of overly similar pages, otherwise known as duplicate content.
It is important to clarify that robots.txt is not a way from preventing search engines from crawling your website (i.e. it is not a firewall, or a kind of password protection) and the fact that you put up a robots.txt file is something like putting a note “Please, do not enter” on an unlocked door – e.g. you cannot prevent thieves from coming in but the good guys will not open the door and enter. That is why we say that if you have really sensitive data, it is too naïve to rely on robots.txt to protect it from being indexed and displayed in search results.
The location of robots.txt is very important. It must be in the main directory because otherwise user agents (search engines) will not be able to find it – they do not search the whole website for the robots.txt file. Instead, they look first in the main directory (i.e. http://www.websitemagazine.com/robots.txt) and if they don’t find it there, they simply assume that this website does not have a robots.txt file and therefore they index everything they find along the way. So, if you don’t put robots.txt in the right place, do not be surprised that search engines index your whole website.
The more serious problem is with logical errors. For instance:
User-agent: *
Disallow: /temp/
User-agent: Googlebot
Disallow: /images/
Disallow: /temp/
Disallow: /cgi-bin/
The above example is from a robots.txt file that allows all spiders to access everything on the website except the /temp directory. Up to there it is fine but after that there is another record that specifies more restrictive terms for the Googlebot spider. When Googlebot starts reading the robots.txt, it will see that all user agents (including Googlebot itself) are allowed to all folders except /temp/. This is enough for Googlebot to know, so it will not read the file to the end and will index everything except /temp/ – including /images/ and /cgi-bin/, which you think you have told it not to touch. You see, the structure of a robots.txt file is simple but still serious mistakes can be made easily.
While Google won’t index or crawl the content of pages blocked by robots.txt, they may still index the URLs if they find them on other pages on the Internet. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the website, or the title from the Open Directory Project, can appear in Google search results.
In order to use a robots.txt file, you’ll need to have access to the root of your domain (if you’re not sure, check with your web hosting company). If you don’t have access to the root of a domain, you can restrict access using the robots meta tag. To entirely prevent a page’s contents from being listed in the Google web index even if other sites link to it, use a noindex meta tag. As long as the Googlebot spider reaches the page, it will see the noindex meta tag and prevent that page from showing up in their index.
Inside the webmaster console Google will also show you what pages are currently blocked by your robots.txt file, and let you view when Google tried to crawl the page and noticed it was blocked. Google also shows you what pages are 404 errors, which might be a good way to see if you have any internal broken links or external links pointing at pages that no longer exist.
Thu, Jan 13, 2011
Development, Internet Marketing, SEO