Honest Search Engine Optimization
I blogged not so long ago on a similar topic Is Page Rank Any Use? and a while back on Website Optimization and Data Analytics .. I have had a couple more data points, and a few thoughts since then.
At the time of the last post, our PR had gone down from 5 to 3. It went back up to 5 and is, now, down to 4.
All without me doing a thing to the site, except I added a few articles. Perhaps some of the incoming links have had THEIR PR changed.. since I cannot control or influence that, I won’t waste any more time on the PR “concept”
If PR is dead, what about SERPs?
I can still easily find any of my keyphrases on Google (refer to the “keyphrase extraction” example in that post) but these usually have a much lower SERP [Search Engine Result Position] using MSN than Google.
At work is the fact that Google knows a lot about me (”learning” my searches, via the Google Toolbar) and that makes it hard to get an “honest answer” about SERPs unless I use a clean machine .. but that won’t be entirely honest because it will correspond to a user about whom Google knows nothing.
MSN (LiveSearch) is generally less successful than Google in my searches (that is, on the topics in which I have an interest), but it is not clear whether LiveSearch is “not as good” an engine as Google, or whether Google “knows me better”.
Clusty - a clustered search engine, very useful imho - finds this site at position 74 out of some 87,000 on a search for “keyphrase extraction” : this may be some sort of reliable indicator, because although Clusty has feeds from Google as well as other sources, I would be surprised if it had my search history from Google, or had accumulated its own - there is no Clusty “toolbar” per se, and I don’t use it frequently.
If I cared about these results, I might also take heart from the fact that Clusty listed me in the first (presumably “most important”) cluster.. but then again, I might be fooling myself.
If I care to continue indulging myself, I might search for “Data Sciences” (including the quotes) with Clusty, and find that we are at positions 2 and 8 in the “Analysis” cluster (the second one).
Good news, no?
Well, maybe.
If I take the quotes away .. well, the picture is much more complex. We are there, but the SERPs are not nearly so encouraging.
What a difference a couple of quote marks make.
And what huge potential for biasing any experiment or evaluation.
Continuing with Clusty, which is at least arguably a “fair” engine (at least somewhat ignorant of what I want to hear) .. If I search for Data Sciences Analytics (no quotes) I get positions 1, 2,5 and 10 on the front page!. !!! Great.
And just to prove it, I go to MSN with the same query Data Sciences Analytics and get position 1 out of 2,690,000 !!!. But no further results until the second page .. aw, shucks.
And to really really prove it I went to a different machine and queried MSN from there with the same query and got the same result!! OK, it was on the same network and shared an IP address, but well, its still a good result, no?
So, if we are going to research SERPs, to get an unbiased estimate of our “TRUE SERP”, what should we do?
Well, if this is not “mission impossible” it sure is difficult.
Firstly, recognize that a SERP is some function of the query, the user, the search engine, the browser and the browsing history, and the machine… these environmental factors coming into it because of the fact the search engines record your behaviour and interests (sometimes).
And some sort of averaging across a set of experiments varying these factors would be a good idea. It might be advisable also to realize that the SERP can take on a huge range and to work with some robust averaging mechanism (the median perhaps, or the proportion of times the query gave a result in the top 100 or ..). And while SERPS are ranks, it might be a good idea to convert them to some sort of metric by weighting positions within a page and across pages with some sort of exponentially declining weight.
The choosing of the machines/users over which to conduct this experiment is also important - you do not want tabulae rasa - you really want those who might be interested in what you are selling to conduct the experiment on their “educated” machines.
All a bit difficult.
But the really hard bit in keeping this experiment honest is in choosing the queries.
Choose some “key phrases” from your website?
Well, depending upon the length of the phrase and the rarity of the keywords, you are going to get a good result, aren’t you?.
Not necessarily an honest one.
But you could start thinking about this as an upper bound of sorts… if you get position 453 out of 10000 for your highly specific query, well that’s not so great.
You could start thinking about adjusting for the individual word probabilities (given an interested user) - that might normalize things somewhat.
And maybe you could get unbiased people to generate some “semantic equivalents”.
But there is a lot more hard thinking needed in this area, imho.
Sandro Saitta said,
February 24, 2008 @ 11:45 pm
“since I cannot control or influence that, I won’t waste any more time on the PR “concept””
Well, in fact you can influence the PR. It is not as easy as influencing the Technorati or Alexa ranks, but it is feasible. If there are more websites (with great PR) linking to your site (and not too many others), then your PR will increase (of course, it will take some time).
Regarding SERP (P is for “Page”), it is not influenced by your search history. Normally, if you search on different computers, with different cache, you should obtain exactly the same SERP (maybe I misunderstood you).
John Aitchison said,
February 25, 2008 @ 10:40 am
Sandro, thanks for the correction re the definition of SERP.
I was really talking about the POSITION on the results page .. I wonder if there is a term for that.?
“Normally, if you search on different computers, with different cache, you should obtain exactly the same SERP (maybe I misunderstood you).” ..
you may be right.. perhaps it is the influence of the cache
If I search for “keyphrase extraction” (with quotes) with Google I get my article (”the Start of the Art in Keyphrase Extraction”) at position 14. This seems high, given the size of the field and the fact that MSN /Live Search finds it at position 61.
Can you, as a favor, run this search for me from your machine and see if you get the same results? I would appreciate it.
Thanks for the comment
Sandro Saitta said,
February 25, 2008 @ 9:43 pm
Since it does not depend on the cache of a browser, I obtain it at the same position (i.e. 14). A very good tool is the SEOmoz Rank Checker:
http://www.seomoz.org/rank-checker
You simply put your website url, the keyword you’re interested in and the search engine. It will then inform you on the position on the SERP.
John Aitchison said,
February 26, 2008 @ 6:51 pm
A useful tool thanks Sandro : it seems that Google does not use any form of “personalized search” , that is, taking into account your search history. I suppose when you think about it, that would be a very hard feature to implement (a sort of recommender system), even for Google.
Sandro Saitta said,
March 11, 2008 @ 1:53 am
Here is an interesting article regarding the kind of data Google use to improve its SERP:
http://www.seomoz.org/blog/proof-google-is-using-behavioral-data-in-rankings
John Aitchison said,
March 13, 2008 @ 9:30 am
Thank you Sandro .. that link leads to lots of interesting stuff.
Firstly, it appears that Google DOES do “personalized search” IF you have the toolbar installed, and IF you are are logged in.
For example, see here http://googlesystem.blogspot.com/2007/04/how-to-disable-google-personalized.html
and
http://www.rimmkaufman.com/rkgblog/2007/06/10/personalized-search/
I have not been able to detect any difference in SERPs so far, but I have not done serious testing.
Google “web history” is explained here
http://googlesystem.blogspot.com/2007/04/google-web-history.html
It looks like you have to specifically enable it, and be logged in for it to take effect. The above link shows how.
OT, the “disabling PWS” article above lists PWS=0 as the appropriate parameter .. here is a list of other URL parameters
http://www.joostdevalk.nl/wp-content/uploads/2007/07/google-url-parameters.pdf
The link you sent “Proof Google is Using Behavioral Data in Rankings” is long and worth reading and does report an experiment by Visio, which may or may not be conclusive. The article really goes to the issue of what else, apart from content, google is using to calculate ranks/deliver SERPs : it is not specifically pertinent to the issue of whether or not google delivers different SERPs to different browsers/searchers.
There is a consensus, of sorts, on “Important Search Engine Ranking Factors” here http://www.seomoz.org/article/search-ranking-factors
and an interesting article
http://www.seomoz.org/blog/throwing-out-the-book-on-seo
“Throwing Out the Book on SEO!”
which argues that “visitor actions” (ie on site behaviours) are the coming thing.