6th March 2007

Checking your supplemental page count

posted in Google, Webmastering |

I’ve written about this topic before, but have found an even easier way that after set up doesn’t take any effort at all.

1) First off, if you haven’t already, go .

2) Now go download and install SEO for Firefox.

Now when you do a search you can see right in your search results the amount of supplemental results each domain has. For further information just click on the supplemental link under each search result to see what pages are in the supplemental index. Also note that a tool bar page rank of four isn’t enough to keep even Matt Cutts pages out of the supplemental index.

supplemental search results

If you liked this post please buy me a beer. Thanks.

This entry was posted on Tuesday, March 6th, 2007 at 5:29 pm and is filed under Google, Webmastering. You can follow any responses to this entry through the RSS 2.0 feed. All comments are subject to my NoFollow policy. Both comments and pings are currently closed.

There are currently 10 responses to “Checking your supplemental page count”

Why not let me know what you think by adding your own comment! All the cool kids are doing it.

  1. 1 MyAvatars 0.2 On March 6th, 2007, Halfdeck said:

    JLH, a supplemental status isn’t an either or situation, where a page is either supplemental or is in the main index. For example, run this query

    http://www.google.com/search?q=SEO+mistakes+crappy&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

    and you’ll see that the TBPR 4 page is in the main index AND in the supplemental index. In other words, what you found does in no way prove that TBPR is inaccurate to the degree that SERP might imply.

    This is also why some people, including myself, always use quotes when we say that a page “goes supplemental.” We believe that all pages of a site is in the supplemental index. But the supplemental pages are masked when a page makes it into the main index. Conversely, when Google drops a page out of the main index, then its shadow in the supplemental index is revealed. That doesn’t mean the page “turned” supplemental. That record in the supplemental database was always there.

  2. 2 MyAvatars 0.2 On March 6th, 2007, JLH said:
    Excellent point Halfdeck and dully noted (by the crossing out of my assumption in the post). Not that it matters much, but you got me thinking, is supplemental really a separate database or just status in the same database.

    So using my query used to find the page that I highlighted as supplemental we see a cache of:
    http://72.14.205.104/search?q=cache:gYwvJd0xNv0J:www.mattcutts.com/blog/seo-mistakes-crappy-doorway-pages/+site:www.mattcutts.com+***+-view:adghasdtrb&hl=en&ct=clnk&cd=5&gl=us

    Using your query the cache of the page is:

    http://72.14.205.104/search?q=cache:gYwvJd0xNv0J:www.mattcutts.com/blog/seo-mistakes-crappy-doorway-pages/+SEO+mistakes+crappy&hl=en&ct=clnk&cd=1&gl=us&client=firefox-a

    Both have the same date stamp and the same cache:gYwvJd0xNv0J (whatever that is).

    So IF (and that’s a big if) the same cache means that its the same file but represented in the supplemental and non-supplemental indexes.

    I’d imagine that being supplemental is just a status that any page is automatically assigned and all the fun that comes with that status, such that it’s crawled on the less frequeent supplemental crawl rate and updated during the supplemental refresh. Now a page may also have regular index status which of course supercedes the supplemental status.

    Here’s another question. Are degrees of being purely supplemental? In that the page is not in the regular index at all but there are different levels of supplementalization (new word) like having levels such that some are crawled every 8 weeks, others every 3 months, etc.

    Coming out of supplemental would then be better defined as actually just being added to the regular index as the page never actually leaves supplemental. Of course if a page does leave the supplemental index then its pretty much gone which is much worse than be supplemental.

    So for all of those crying to get out of the supplemental index, be careful for what you wish for you just may get it. What you should really be pining for is to get back into the regular index.

  3. 3 MyAvatars 0.2 On March 7th, 2007, JohnMu said:

    The supplemental and the main index are filled with duplicates: you cannot “count” the number of pages (that’s why the “about”-count is usually so far off) since the count (from the “database”) is way off all the time. If your site has a number of URLs in the index (take a shop for example), set up a Google Custom Search Engine. Adjust the filters using very fine changes: you’ll see how all the different variations of the pages show up step by step (the more you filter out of the index). In the end, I saw that the average (dynamic) page in a forum was indexed over 10x with different variations of the URL, but in the index (with the site:-query) it was shown only once. Which count is the right one?

    This is multiplied by using a plugin which does things that you do not know or can not reproduce. This is a general problem with all tools that can not show how a result was found: how do you know it’s doing it “right” (especially when there is no obviously right way to do it)? Where is it getting it’s data from? And on Google: which datacenter - or does it not matter?

    These things make the supplemental index so crazy - nobody can get a real grip on them :-).

    The question is - do we really need a grip on the items in the supplemental index or should the average webmaster concentrate on getting things into the main index (regardless of whether or not they’re in the supplemental index)?

  4. 4 MyAvatars 0.2 On March 7th, 2007, Halfdeck said:

    Good questions John (and interesting point, Softplus).

    Here’s something Matt Cutts said that’s often quoted (and worth quoting often):

    “so when Bigdaddy didn’t select pages from a site, that would expose more supplemental results for a site.”

    Underline “expose.” He didn’t say pages “go” supplemental or they were “tagged” supplemental. He said more supplemental results would be “exposed” - as if they are usually hidden or masked.

  5. 5 MyAvatars 0.2 On March 7th, 2007, Aaron Pratt said:

    All but the first link in the above image for Matt Cutts are duplicates so they should be in supplemental. Maybe cuz the title had the word “crappy” in it the algorithm saw that post as low value. ;)

  6. 6 MyAvatars 0.2 On March 7th, 2007, JLH said:
    Excellent points by all so far. As an aside this is why I have published my nofollow policy, the comments here have added valuable content to this page and should be at the minimum should offer links to the authors.

    Truly I stopped worrying about supplementals a while back around the bidcrappy time when the site: operator went to hell. But it’s an issue that comes up often in discussion and with your insights I’m getting a clearer view on it.

    As satisfying as it is to answer somebody’s question with, “don’t worry about pagerank just build your site for users and you’ll get links naturally” it really doesn’t put the questioner at ease any. Same thing goes with, “My site went supplemental, what should I do?” Perhaps the right answer is “don’t worry about it all pages are actually in the supplemental index, work on your site and garner natural links so that more pages stay in the regular index.” But again not that satisfying to the average panicking pickle jar art salesmen on the web. Having concrete examples like Halfdeck showed and clinical observations from John to show as an example go a long way to put someones mind at ease.

    I’m not sure if this is totally accurate but I look at the supplemental index as a prioritizing tool for Google. The entire web is growing faster than they can keep up. By keep up I mean keeping up with a FRESH crawl of each page and even space in the SERPS. Of course their isn’t a finite amount of search queries as any combinations of words could be used, but after a while I’m sure Google has a statistical hold on what the top 90% of searches are for. Within those searches they only need to show 1000 results, so in essence there is a finite amount of results available, which they are constantly working on improving. The biggest subset of the web however is going to be that remaining 10% (numbers just made up by me for clarity) which is pretty much an infinite amount of queries and possible results.

    On the other side of things, the internet itself its growing exponentially. With CMS like wordpress proliferating on the web anybody can publish a 100 page site in minutes. When Larry and Sergy put this whole google thing together in their dorm room it took some effort to publish a site, there were large obstacles to getting in the game. Now the domains are a $1.99 and hosting is cheaper. Sure they add new data centers and new crawlers and have geniuses working for them that can scale the database up so that you get a search result in 0.0245 seconds, but there is a limit to their growth potential based on hardware, bandwith, and database updates.

    Given the nature of a relative finite amount of ‘popular’ searches and infinite amount of internet growth coupled with limitations on crawling capacity growth they had to come up with a solution. That solution in my opinion is the supplemental index. It’s a status given to all “discovered” urls that have any value whatsover, a link at some point in time had pointed to the page. Granted the page may not come up for searches often, may be gone, may actually have little or no value, but it’s their duty as a good search engine to at least keep it in the index. However they don’t have the resources to update it as often as the washingtonpost’s home page, nor should they. Thus was born the crawler priority, with supplemental being the lowest priority.

    I don’t have a clue where the cut-offs are but would imagine that every url that is deemed worthy of being indexed is given a crawler priority. Some are updated daily, others weekly etc. All urls are in the supplemental crawl priority which is months rather than days or weeks.

    Now here’s where it gets interesting in my point of view. If given what I said above is even remotely true, it’s a scalable solution to a point but requires the judgement of an computation to decide when to crawl pages. A judgement that in my mind is pretty bad at times.

    Looking at my own server logs I am amazed at some of the pages that get crawled regularly, but equally confused at some of them that don’t. Of course if you want to you can play with site structure and external links to improve a pages crawling, but is that really a productive use of ones time.

    To this end, I offer a proposal to the Google team that I’m sure none will read, but I’m going to do it anyway because this is my blog dammit!

    1) Remove the green supplemental thing. The average searcher has no idea what the heck it means anyway and it does nothing but infuriate the webmasters that do care. You can still have the supplemental priority and all, I just don’t see the point of it. I think its googles way of having a disclaimer that the URL hasn’t been crawled in a while so they can’t say if its going to even look like the cached copy.

    2) For a given site I’m sure locked away somewhere in a database is a number that’s calculated based on the amount of pages in the site and the crawl priority for each page. Call this crawl load. The crawl load is calculated somehow like this:

    Say you have a 100 page site. Google has given values to each of those 100 pages, fictitiously let’s say that 10 are to be crawled weekly, 30 are to be crawled bi-weekly, 30 are to be crawled monthly, and the remaining 30 are to be crawled every 3 months. This would give us a calculated crawl load of (10 x 4 x 3) + (30 x 2 x 3) + (30 x 3) + (30 x 1) = 420 or 420 crawls per given 3 month period.

    3) Now we get into sitemaps which brought us all together in the first place…the webmaster gets to give a hint of what they think is a crawl priority based on their knowledge of the site. For instance my T&C page, the contact page, and all of the product descriptions hasn’t changed in 9 months so I’ll put those at the lowest of the scale say 10. However I update my blog daily and have a news page that updates every 3 days or so, so I prioritize those higher.

    4) Google then calculates what your crawl load score is based on your recomendations, if it’s lower than theirs they use it. If it’s higher than theirs they use theirs.

    5) Let this fact be known and you’ll have webmasters all around the world scrambling to optimize their crawl priorities such that their most important pages are crawled regularly and the one thats are the most static are not. It would be a win-win, google can concentrate on crawling new sites more with the found bandwidth and the site owner would be happy because all of their caches would reflect what’s actually on the page.

    Now this is just a dream of mine and carries no weight at all, but if you’re still reading, thanks.

  7. 7 MyAvatars 0.2 On March 9th, 2007, Sebastian said:

    Identical timestamps could be explained with crawler optimization. If one crawler has fetched a page it’s put in a cache where every process can get a copy for its own purposes. That’s no proof for the one database theory, and it doesn’t prove that there are two databases;)

  8. 8 MyAvatars 0.2 On March 9th, 2007, JLH said:
    How can we make wild-ass-guesses if you are going to start injecting logic into the equation?
  9. 9 MyAvatars 0.2 On March 9th, 2007, Sebastian said:

    I’m sooo sorry!

  10. 10 MyAvatars 0.2 On July 10th, 2007, Introducing a new SEO term: Supplemental-Only » JLH Design Blog said:

    [...] “My site went supplemental” A great discourse took place right on this blog (read the comments) that of course didn’t get a lot of airplay, but had great [...]

  • Please Support

  • Marquette University

  • Sponsored