Evolving Datasets and the Netflix Prize

I have decided to throw my hat in the ring for the million dollar Netflix Prize. Well, tentatively so.

That is, until I can form an opinion as to whether there is a reasonable prospect of reaching the target – the accuracy bar that Netflix has set.

There may indeed not be any such reasonable prospect. Netflix does not know if it is achievable, the contestants don’t know, and the recent slowdown in the gains achieved (check the leader board) suggests that we may be approaching the limits of predictability (from the data as it stands – it would be a different story, perhaps, if the data was enriched).

Be that as it may, I have downloaded the data and looked at it and have formed some tentative opinions and have some ideas about analytic approaches. (Yeah, looked at it, looked at the raw records themselves. Statisticians often talk about starting with data visualization, I like to start a step back and look at the records themselves – give me respondent’s raw questionnaires any day).

Now before you web analytics types shrug and say “collaborative filtering has nothing to do with me, nor does the problem of predicting preferences based on past revealed preferences (as indicators of “taste”), just hang in there a bit, OK.

Because the biggest single problem I see with this dataset is that it is evolving.

And your dataset, your business, might be evolving too. And your modelling might have to take that into account, seriously take that into account, too

Most statisticians and data miners have to deal with a static dataset- there it is, that is the data we have, go boy, go dig.

And some of the contestants in the Netflix Prize seem to have taken just that viewpoint, that they have a dataset that represents a time slot of a few months and they will analyze that as a monolith and ignore the date information on every record.

That may be within the rules, but does not represent the reality of Netflix’s business, borrowing as it does from the future. In reality, we have only an ever expanding past or a sliding window of that past and we have little or no way of knowing how “stationary” or “currently representative, in the eternal now” that past is.

But the incompatibility of the various slices of the past is dramatically highlighted in Netflix’s case.

Where to start?

Well, very obviously, new DVDs are being released on to the market all the time and perhaps some old ones are being deleted from stock.

The consideration set is constantly changing, at least in its detail – we may be able to put some static structural overlay by invoking the concept of genres (if we can find them or agree on them) but even those won’t be static (although they will hopefully be more slowly evolving) and it is by no means clear that genres will help us in prediction.

Nice to know, perhaps, and it is nice to feel that one “understands” the patterns in the data, but the jury is still out (in my mind, for this application) as to whether we will get real predictive gains from that approach.

And Netflix’s business is itself changing. They are getting bigger, more successful, and have more titles in stock. That may have the effect of diluting correlations, of making the patterns we observe in the past decay in their relevance. Handling this effect is non trivial.

I have concerns too about the underlying spikiness of the data. A new DVD for a recent hit movie is released and is heavily promoted .. bam, the DVD rentals go through the roof – unpredictably so, from the point of view of this dataset which does not include that data.

A big dataset. Big == Good?

We have around 100 million ratings in this sample, on a 5 point integer “scale” (which, btw, it is possible that Netflix has fiddled with because apparently half point ratings used to be allowed), nearly 500,000 customers and nearly 20,000 movies. A fair bit of data, I guess.

Specifically

  • 480,189 User ID’s
  • 17,770 Movies
  • 100,480,507 Ratings collected from October 1998 to December 2005

For a good look at some of the characteristics of it, see
Dissecting the Netflix Dataset

There are indeed a lot of problems with the dataset. I am not easy in my mind about the sampling procedure employed, nor about the impoverished nature of the data ( no demographics on customers, no information about the movie content, actors or directors and an effective prohibition on getting this information from other sources ).

I am not even sure that the problem as phrased and the dataset as provided actually map closely to what Netflix wants/needs. At the end of the day they probably want a recommender system .. but they are asking us to predict how much a person who has already chosen X will like X. A disconnect here.

But, datasets like this happen in real life, and it is of interest to see how far we can get.

Some have suggested that this is a PR stunt by Netflix, others that it is a cheap way of getting good minds to work on it ( $1,000,000 won’t buy you much of a hotshot research team), and still others that it is a new way of encouraging distributed collaborative research.

I won’t get into that except to note that competitions are not new (think architectural competitions), nor necessarily fair to the participants, nor indeed necessarily a good way of getting the real problem solved.

And perhaps to point out that if you are asking a whole bunch of people who are AI and optimization experts to compete, then you had better expect that the system will be gamed. Have a look at the learning forum to see how some of the contestants are thinking.

Well, what next?

Next is not about analysis.

Next is about an efficient data representation that is going to be nice to me when I ask it to do hard things, and hard things lots and lots of times.

So, I am not going to be talking about building a mySQL database from this although it is feasible (some contestants have done this), I am going to build a “virtual dataset”.

This virtual dataset pretty much ignores the way the data is currently structured (a few hundred thousand files, 1 per movie, customers and dates and ratings within that) and aims to treat it as one huge (multi gigabyte) matrix of movies x customers x dates – from which we can extract just about anything, including summaries along any dimension, customer history sequences, subsets of the data for pushing to an analysis routine etc.

So, we are not yet to the data mining/statistical analysis stage and we are not going to be doing any heavy duty analysis until I build yet another framework – one to record the details of all of the analyses, their intent and their results.

Mining one’s own processes requires tools too.

After that, I will implement some baseline approaches like SlopeOne which - from the bare bones information supplied is what I suspect Netflix is doing currently, and then look at the errors and post process those. I do have some ideas for algorithms, but I am not going to get into that implementation phase until I am convinced that there is some extra predictability in the system.

Well, it should be fun.

(I’ll let you know if I win the $1m, maybe.)

5 Comments »

  1. Shane said,

    March 20, 2007 @ 11:25 am

    Isn’t it great the a company is prepared to put up $1M to see a problem like this solved!! Who cares what their motivation is!

    Netflix’s Cinematch is a variant of Pearson’s correlation, according to the paper linked on the KDD Cup 2007 website. So far the most popular algorithm on the forum seems to be Singular Value Decomposition, which gains about 5% on Cinematch, but others seem to have tried SlopeOne also.

    Great to hear you are going to takle Netflix, good luck!

  2. John Aitchison said,

    March 20, 2007 @ 7:58 pm

    Thanks, Shane, for the post and good wishes .. I see you are in Melbourne, my old stamping ground and, it seems, a bit of a hot spot for data analytics.

    I have added you to my blogroll, dk why I did not find you before. No need to reciprocate - I see you have your blogsite set up a bit differently.

    I am going to follow up on your post about Australian geocoding via Google- the ABS has their plans for this, but maybe they have been pre-empted. Anyways, I am always interested in geodemographic data .. see, eg http://dsanalytics.com/dsblog/australian-census-data-is-finally-again-free_62

    As for the Netflix data, it does not immediately present to my mind as something where linear correlations are entirely appropriate. There is a lot of restriction of range (most ratings are 3, 4, or 5) as is to be expected (these people did after all expect to like the movie when they borrowed the DVD) , and the scale is at best ordinal. I think there are some limen issues here too.. what is the tipping point what tips a 3 into a 4?

    So, it should be interesting.

    Thanks for the comment

  3. Shane said,

    March 22, 2007 @ 10:58 am

    Thanks for linking me, I will have to start blogging again!

    How awesome that the ABS data is now free! Hopefully we will see something along the lines of what the Juice guys have done:
    http://www.juiceanalytics.com/weblog/?p=119
    http://www.juiceanalytics.com/weblog/?p=202
    http://www.juiceanalytics.com/weblog/?p=144
    http://www.juiceanalytics.com/weblog/?page_id=99

  4. » Taming the Google Monster–we have PageRank 5!. Cool, I guess. [ Data Sciences Analytics ] said,

    May 24, 2007 @ 11:05 am

    […] Who knows?. These evolving antagonist problems are intrinsically very interesting, though. And I guess the web dataset is the biggest game in town, bigger than Netflix – sorta hard to resist. I’d quite like to build my own semi-supervised crawler/spider, but perhaps not tonight Josephine. […]

  5. John Aitchison said,

    February 20, 2008 @ 6:45 pm

    small update .. yes, I am still working on it in my spare time

    Whimsley has an article at http://whimsley.typepad.com/whimsley/2007/07/the-limitations.html
    in which he explores the notion that improvements in the RMSE may not translate to (noticeable) improvements in the customer experience (he also has a good summary of the problems with the data)

    and Yehuda Koren (Korbel.. the current top team) argues somewhat to the contrary .. that is, that RMSE does matter, but he also takes a ranking approach and his algorithm is influenced by the idea that recommender systems should predict the “top K” , which implicitly means placing more emphasis on predicting the “5’s”.. See http://www.netflixprize.com/community/viewtopic.php?id=828

RSS feed for comments on this post · TrackBack URI

Leave a Comment