Taming the Google Monster–we have PageRank 5!. Cool, I guess.

After more than 6 months of PR 0, it is nice to get a PR of 5, all of a sudden.

But why? And why such a big jump?

As a statistician/analyst/modeller, I am somewhat interested in the behaviour of the Google Page Rank system.

Not reverse engineering the black box in order to manipulate it which, besides being arguably against the Terms of Service, is probably a futile endeavor (think, active wealthy antagonist, whose interests it is in to change the rules to frustrate the guys in the black hats).

Just a starting model that is reasonably consistent with my, and others’, experience.

What we think we know - about PR and incoming links

OK, we know that – ostensibly – Page Rank (and remember I am not talking here about Search Engine Result Position, just PR) is a function of the number and PR of incoming links. Some complex function, and if you read the docs it sounds like some form of SVD of the entire web links matrix, which would take some time to compute : if so, it would explain why lots of page ranks are seriously out of date and as Google suggests “for entertainment value”.

It would also explain why my PR jumped from 0 straight to 5, if the recomputation is done every three months or so. Which actually sounds sorta dumb – surely some approximate model could be fitted.

But irrespective of the algorithm, PR is held to be some function of the number of incoming links and their “popularity” as measured by PR.

But the anecdotal evidence does not support this all that well. Get onto any SEO forum eg WebMasterWorld and you will find plenty of stories of sudden and unexplained page rank jumps (often negative), and PR being stubbornly resistant to any increase in the quantity and quality of incoming links. Stories too where an examination of the links of two sites does not bear out the basic proposition.

In my case, I had a PR of 0. There are few incoming links, and those that I have were mostly PR 0 (one PR 4) and now some of which are PR 4 (which runs counter to the notion that you have to get links from sites with higher PR ranks than yours .. I have a 5, some now have a 4. Did they get that from me, or I from them?)

And no, I am not listed on dmoz (I tried) or in any other catalog of note. You can look at my inbound links if you want with (yahoo) linkdomain:www.dsanalytics.com -site:dsanalytics.com : not very impressive.

Is PR 5, like, truly WOW!?
. . . so I must have done something right, right?

A page rank of 5 is, as I understand it, “fairly high”. It is supposedly distributed according to some sort of power law / is on a log (base 10) scale .. so, roughly speaking I should have about 10 times as many incoming links as those with PR 4.

I rather doubt it.

So what’s the theory, and where’s the evidence

Ian Roger’s paper The Google PageRank Algorithm and How It Works supports the notion of a logarithmic scale and also offers some insights (and code to approximate the PR eigenanalysis) about how the internal structure of the site might affect PR of each of the pages.

Well, I have not tried it (looks a bit too simple to be true), and I have done absolutely nothing to the site structure that might make it look better to Google. Nor have I signed on to the Google Analytics project to see how Googlebot sees my site.

There is another description of the PR algorithm at Network Workbench - effectively a power method of extracting eigenvectors, which might be of interest.

There is a bit about the Theory of Networks at Linked: The New Science of Networks by Albert-Laszlo Barabasi which supports the log base 10 notion


On the web, the top websites have ten times more links than the next set, 100 times more links than the third set, and 1,000 times more links than the fourth set. Google’s Page Ranking technology is based on log distribution. A website with Google PageRank 5 (PR5) is ten times bigger than a website with PR4, 100x a PR3, 1,000X a PR2, and 10,000X a PR1 website.

and has an interesting comment on tipping points and phase transitions.

ScienceBits has done some actual analysis of 120 websites and plotted PR as a function of log(Number of Incoming Links) … which looks linear enough ( if noisy ), so there is some support for the notion that PR is approximately linear in number of incoming links.

A compatriot of mine, Dr Alex, a pretty smart fella who wrote the best Javascript editor around (imho) and who also happens to be an International Chess Master with a PhD in Computer and Information Science and an Internet entrepreneur – a polymath from my ancestral home of Adelaide, yet! – has his C Point site “rated 6 out of 10 by Google (aka PR=6) it is in the top 68,000 sites world-wide by traffic - attracting 54,000 visitors per month”.

So, that’s another data point .. PR6 equates to some small top-end fraction of the world’s sites, and so PR really maybe is logarithmic, and so my humble PR5 should mean I am in the top 680,000 sites worldwide and get 5,400 visitors per month. Well, I have no reason to doubt his data but I am still iffy about the extrapolation.

So, there is plenty of speculation out there, and some softish data.

Convinced?

I am still not convinced that the theory of PR and the limited data stack up.

Pleased that Google gave me a PR 5 ( I guess: I rather doubt that it will translate into more visitors ).

But not convinced that PR does not have a very large “random” component.

And perhaps a little suspicious that the Google PR algorithm, which supposedly used to be just a static measure of link popularity (computed at crawl time) is evolving towards something that includes “content quality” and “freshness” so we no longer have such a clear cut distinction between relevance (to a query) and static measures of “popularity”. Or maybe the static measures have become niche-ified, and PR is PR relative to a niche.

Who knows?. These evolving antagonist problems are intrinsically very interesting, though.

And I guess the web dataset is the biggest game in town, bigger than Netflix – sorta hard to resist. I’d quite like to build my own semi-supervised crawler/spider, but perhaps not tonight Josephine.

in conclusion : Google tamed?

So I guess maybe I don’t have Google “tamed” .. just caught it on one of its good days, more likely.

1 Comment »

  1. Shane said,

    May 24, 2007 @ 2:48 pm

    Don’t know why your page rank is jumping around but Brin & Page’s 1998 paper is still online and a fascinating read:
    http://infolab.stanford.edu/~backrub/google.html

RSS feed for comments on this post · TrackBack URI

Leave a Comment