It seems a day doesn’t go by in GWHG that someone is concerned that some page that they blocked in their robots.txt file is showing up in Google. Google’s handling of the robots.txt is quite elaborate, well documented, and easily tested. Having said all of that many do not fully understand the intent of robots.txt and how the opportunity to use it for optimization of a web site.
Any discussion of robots.txt cannot be complete without the caveat that only GOOD robots follow it and it’s a very public file, so don’t expect it to keep out rouge bots or as a security measure to keep stuff hidden. That being said, I’d like to talk about an obedient bot, googlebot.
As elaborate or simple as your robots.txt may be it accomplishes one thing it directs the crawler where it can and cannot go explicitly by disallowing some pages/folders or indirectly by only allowing certain pages and blocking others. Stopping the crawler from crawling a page should not be confused with giving it direction on what to do with that page. As a matter of fact Google will indeed index urls that explicitly blocked by the robots.txt file. Since they cannot crawl them they really don’t know what’s on the page so the URL will often be listed as URL only without a Title or description (snippet). Sometimes if they can find the information elsewhere like the ODP they’ll use that to help fill in the blanks.
I don’t know exactly what threshold exists for the decision to include a URL that’s blocked by robots.txt but I’d imagine as with anything Google it has something to do with the quantity and quality of links pointing to it. That being said, and as anyone who’s trying to rank something in Google knows, those links are gold and not to be taken too lightly. Most honest-to-goodness real links start out in someones browser bar. They’ve navigated to a page and found it interesting enough to tell others about it by cutting-n-pasting the URL into some sort of HTML somewhere. It would be a crying shame if Google were to follow that link only to be blocked by a robots.txt and not be able to transfer any value to the site other than to list the URL as URL-only in the search results, which will more than likely only ever be shown for a search on the anchor text, which may actually only be “click here“.
Say Matt Cutts really wants to rip into me with one of his famous debunking posts. In part of his article he really wants to show how often I speak of Google on this blog. To emphasis that fact he may link to an internal site search page like: http://www.jlh-design.com/?s=google which will find all the posts here that use the word Google. Being a good webmaster I don’t want Google to return my search results in their search results as we’ve been warned not to.
I could block all search results from being crawled in my robots.txt with something like this:
User-agent: * Disallow: /?s=*
Which will keep Google from crawling that URL. However a link from Matt Cutts is prized and rare so I may want to take advantage of it when it does come around.
The better option is to allow the URL to be crawled but stop Google from indexing it via a robots meta tag.
<meta name="robots" content="noindex,follow,noodp,noydir" />
The page that Matt linked to does contain all of my site’s navigation pointing to previous posts, the home page, categories etc, that I’d like indexed and ranked. Allowing Google to crawl the page and follow the links while stopping it from being indexed accomplishes the goal of keeping it out of the index but passing value to the site as a whole.
For a fine example of this in the wild let’s take a renowned SEO site SEOmoz who has this in their robots.txt file.
User-agent: * Disallow: /ugc/category/
So remember that robots.txt doesn’t stop a page from being indexed it does however stop the page from passing any value to your site if they can’t crawl it. Using the robots noindex meta tag will control indexing but allow crawling for discovery of other links on the page.