• JLH Design

  • Don’t use Robots.txt to control indexing

1st August 2008

Don’t use Robots.txt to control indexing

posted in SEO, Webmastering |

It seems a day doesn’t go by in GWHG that someone is concerned that some page that they blocked in their robots.txt file is showing up in Google. Google’s handling of the robots.txt is quite elaborate, well documented, and easily tested. Having said all of that many do not fully understand the intent of robots.txt and how the opportunity to use it for optimization of a web site.

Any discussion of robots.txt cannot be complete without the caveat that only GOOD robots follow it and it’s a very public file, so don’t expect it to keep out rouge bots or as a security measure to keep stuff hidden. That being said, I’d like to talk about an obedient bot, googlebot.

As elaborate or simple as your robots.txt may be it accomplishes one thing it directs the crawler where it can and cannot go explicitly by disallowing some pages/folders or indirectly by only allowing certain pages and blocking others. Stopping the crawler from crawling a page should not be confused with giving it direction on what to do with that page. As a matter of fact Google will indeed index urls that explicitly blocked by the robots.txt file. Since they cannot crawl them they really don’t know what’s on the page so the URL will often be listed as URL only without a Title or description (snippet). Sometimes if they can find the information elsewhere like the ODP they’ll use that to help fill in the blanks.

I don’t know exactly what threshold exists for the decision to include a URL that’s blocked by robots.txt but I’d imagine as with anything Google it has something to do with the quantity and quality of links pointing to it. That being said, and as anyone who’s trying to rank something in Google knows, those links are gold and not to be taken too lightly. Most honest-to-goodness real links start out in someones browser bar. They’ve navigated to a page and found it interesting enough to tell others about it by cutting-n-pasting the URL into some sort of HTML somewhere. It would be a crying shame if Google were to follow that link only to be blocked by a robots.txt and not be able to transfer any value to the site other than to list the URL as URL-only in the search results, which will more than likely only ever be shown for a search on the anchor text, which may actually only be “click here“.

Say Matt Cutts really wants to rip into me with one of his famous debunking posts. In part of his article he really wants to show how often I speak of Google on this blog. To emphasis that fact he may link to an internal site search page like: http://www.jlh-design.com/?s=google which will find all the posts here that use the word Google. Being a good webmaster I don’t want Google to return my search results in their search results as we’ve been warned not to.

I could block all search results from being crawled in my robots.txt with something like this:

User-agent: *
Disallow: /?s=*

Which will keep Google from crawling that URL. However a link from Matt Cutts is prized and rare so I may want to take advantage of it when it does come around.

The better option is to allow the URL to be crawled but stop Google from indexing it via a robots meta tag.

<meta name="robots" content="noindex,follow,noodp,noydir" />

The page that Matt linked to does contain all of my site’s navigation pointing to previous posts, the home page, categories etc, that I’d like indexed and ranked. Allowing Google to crawl the page and follow the links while stopping it from being indexed accomplishes the goal of keeping it out of the index but passing value to the site as a whole.

For a fine example of this in the wild let’s take a renowned SEO site SEOmoz who has this in their robots.txt file.

User-agent: *
Disallow: /ugc/category/

Yet Google has 28 URL-only pages indexed currently. (screenshot)

So remember that robots.txt doesn’t stop a page from being indexed it does however stop the page from passing any value to your site if they can’t crawl it. Using the robots noindex meta tag will control indexing but allow crawling for discovery of other links on the page.

This entry was posted on Friday, August 1st, 2008 at 5:20 am and is filed under SEO, Webmastering. You can follow any responses to this entry through the RSS 2.0 feed. All comments are subject to my NoFollow policy. Both comments and pings are currently closed.

There are currently 2 responses to “Don’t use Robots.txt to control indexing”

Why not let me know what you think by adding your own comment! All the cool kids are doing it.

  1. 1 MyAvatars 0.2 On August 4th, 2008, Everett said:

    That’s all fine until you start dealing with eCommerce sites that have a ton of duplicate content created from dynamic functions that are out of your control, for which the “system” is unable or unwilling to allow dynamic insertion of a noindex meta tag (i.e. if you add one to the dynamic variable page ?sort=bestsellers it also adds it to the unsorted page). In those cases, you have to block the content from being indexed by whatever means you have at your disposal - namely robots.txt with wildcard disallows. Ideally, one would have enough control over their site, or their clients’ site to do a dynamic insertion, but with big eCommerce soutions this is not often the case.

  2. 2 MyAvatars 0.2 On August 19th, 2008, Patrick Daly said:

    Great post. Often the ease of a one-liner in the robots.txt seems the easiest and quickest solution, but you convinced me otherwise.

  • Please Support

  • Marquette University

  • Sponsored


  • Donations

  • ;

Enter your email address:

Delivered by FeedBurner

rss posts
Spread the Word
  • Readers