Archive for April, 2007

Cookie Death

Yet another piece of research that points up the difficulty of getting hard numbers on the web.

The problem is of course that users can, and do, delete cookies at will - supposedly about 30% of users delete cookies once a month. See “Cookies will get you fat” at Immeria, for a nice summary of the “research” and what to do about it

# Avoid providing hard numbers; trends and context are essential.

# Use the “visit” as the metric of choice, think in term of “opportunity”.

# The only instance where you should report on “unique users” is when you can use a real unique identifier such as a login.

which last bit I concur with.. the others are perhaps sidestepping the real issue, the underlying massive unreliability of the data, particularly if repeat visitation is something that is relied upon.

For the original report, look at the Comscore press release
and some earlier reports “Cookie Death Small Potatoes” and “Don’t Grieve for Cookies”.

There is even an innovative suggestion “Tacoda Tech Replaces Deleted Cookies” which seems to involve taking a hash (unique summation) of first party cookies to form a unique identifier .. pity that won’t work, because it appears that first and third party cookies are deleted about equally often.

Not surprising .. many people use a smart cookie cleaner .. for example, I use CrapCleaner frequently (but on no fixed schedule).

No such thing as a unique user - maybe a unique combination

And as Immeria most sensibly points out

The study from ComScore doesn’t cover the fact that more and more users are using multiple devices (home & company computer, PDA and other devices) and even multiple browsers (sometimes switching between Firefox and MSIE, both installed on the same computer). So the notion of “unique user” should be retitled to “unique device + browser”

Right. I personally use 2 desktop machines with three browsers on each, plus a laptop, plus ..

So, what does a statistician make of all this?

First of all, be wary, be very wary. This is not good data. We have data that is affected by either “missing at random”ness or “censoring” or both.

And it may not be “just” a matter of bias (inflated figures), it may be a matter of inflated variance. And that affects your significance tests . whether you can conclude that one campaign is better than another, whether there are real trends.

To work this through properly, we would have to postulate a reasonable DGP (data generating process) and run some simulations.

But here is the intuition.

Working with Visits

Suppose that we are running a campaign at time T(1) and have collected visitor/visit data at times T(0) (before the campaign) and T(2) (after the campaign).

At each of these three times we have an accurate measure of the number of visits (H), aka hits, from log files. Now we know that visits is a stochastic quantity, it has a random component due to .. just about anything, the weather, competitive activity, whatever …

And with a long enough prior history we might have a good estimate of the month-to-month variability of H, its variance.

So, we could, in theory, run a significance test of the change in H from T(1) to T(2) and find a weight of evidence for any change (in H(0) to H(2)) being adducable to the campaign rather than just random. OK, it’s a bit more complicated than this, but that is the general idea - what is the probability, the likelihood that the observed change was due to our action compared to it being just a natural outcome of a system that moves up and down and around?

Working With Visitors

But what if we were not satisfied with hit counting? There might be good logical reasons for this .. the idea of the campaign might be to get more “users” not more “uses”.. and if so, it is not much use if all the campaign did was to induce the existing pool of visitors to visit more often.

We want to expand V (the population of visitors) .. note that there is terminological sloppiness here : we really DO want to expand the population of visitors, but at best (even if there was no cookie death) we would be observing only an estimate of V .. think mark-recapture sampling, where you catch some salmon, tag them and throw them back. Then catch some more and see how many have been tagged.

I will continue to talk about V, visitors/month as if it was the real V, for convenience, but the real V is all the salmon, not just those we see.

Here is where it gets interesting. We want to estimate V from H. Now, some of the salmon did not lose their tags so we have an error-free (perhaps) count of those. From the remaining hits, not those caused by the tagged salmon but some of which are caused by previously tagged salmon who had lost their tags or adversarily removed them (the smarter ones?) and some by never tagged salmon, we need to estimate how many salmon caused those hits.

And here is where an extra stochastic parameter enters in. The hits per visitor (HPV) is a random variable, which we know only imperfectly, but which we have to use in order to convert our “left over hits” into visitors. Unfortunately HPV is likely to have a long tail, too.

And there is another unknown - the proportion of salmon who have lost their tags (killed their cookies). We can estimate it (perhaps by using external panel surveys), but it too is a random variate subject to uncertainty.

So, the calculation of the number of unique salmon (um, visitors) from those observed (hits) is one that involves uncertain parameters - hence the variance gets increased, and simple minded significance tests are most probably wrong.

OK, enough fishy stories

Comments (1)

Mining Terabytes on the Desktop

I have always had a soft spot for the book “Managing Gigabytes” by Witten, Moffat and Bell. If memory serves me they did indeed foresee the day when we would be working with terabytes (thousands of gigabytes), more or less routinely.

Although the book is primarily about compressing and indexing documents and data, it also goes into some fascinating areas such as perfect hashing, and points up the connection between compression and modelling and the minimum description length principle (MDL) .. the idea that the best theory for a set of data is the one that minimizes the size of the theory plus the amount of data necessary to specify the data relative to the theory (the residuals, if you like). (MDL is a lovely idea, “small is beautiful”, but then in most algorithms we also have a time-space tradeoff ).

All of this comes home to me as I work on the Netflix dataset where I have about 100 million records which need to be intensively mined., and I start to wonder about what the limits of what can be done on a single desktop or a small LAN.

High Performance by fitting the data storage/retrieval to the algorithm

In applications such as this there is usually a need to fit the datastructures and data storage to the algorithm. With one of my algorithms (not unrelated to Singular Value Decomposition), sequential retrieval of the data is called for and I can achieve about a single pass through the data in about 20 seconds .. about 5 million records per second : this is NOT keeping all the data in core, but making sure that the underlying Operating System and hard drive caching work with me and not against me.

Locality of Reference, Feasibility, and Scalability

So called “locality of reference” is what this is all about.. aka, don’t ask the read heads to move about too much. In another algorithm, I need some random starting points, but can proceed sequentially from there. Again, this works OK, but only by being very careful with the file setup. Sometimes duplication of files (in different physical orders) is better than the single hard disk.

This sort of performance can never be achieved through SQL or any mainstream database, where the data of interest – the logical sets- are not physically contiguous on the hard drive.

So, who cares? Well, the issue is one of feasibility. Is it possible to use state of the art algorithms on this problem, or do I have to change the algorithms to suit my hardware resources?. Scalability is also the issue .. sure I could put another PC on the network but that only doubles my grunt power, not changes it by an order of magnitude. And some algorithms are not easily parallelizable anyway.

You can get some biggish numbers in Churn analyses too.

A colleague recently told me that he was working with more than 50 million records per day. No prizes for guessing the industry, but you get these sorts of numbers in any environment where you have electronic monitoring or natural phenomena (eg high energy physics, video monitoring, weather.. see ‘The Terabyte Challenge‘).

The Center for Customer Relationship Management at Duke University had a “Churn Response Modelling Tournament” in 2003. Only about 100,000 records and 171 variables on each (so, dense, not sparse) – you can see the specs here.

In brief:

The predictors include three types of variables: behavioral data such as minutes of use, revenue, handset equipment; company interaction data such as customer calls into the customer service center, and customer household demographics.

Customers were selected as follows: mature customers, customers who were with the company for at least six months, were sampled during July, September, November, and December of 2001.

For each customer, predictor variables were calculated based on the previous four months.

Churn was then calculated based on whether the customer left the company during the period 31-60 days after the customer was originally sampled.

The actual percentage of customers who churn in a given month is approximately 1.8%.

However, churners were over sampled when creating the Calibration sample to create a roughly 50-50 split between churners and non-churners (the exact number is 49,562 churners and 50,438 non-churners).

Over sampling was not undertaken in creating the Current Score and Future Score validation samples. This is to provide a more realistic predictive test. The Current Score data contain a different set of customers from the Calibration data, but selected at the same point in time. The Future Score data contain a different set of customers selected at a future point in time.

It is not always a good idea to work with the whole dataset

Apart from the feasibility issues, the data set you have been given may not be the best one to work with.

You may get more insights into model performance and problem areas by stratifying the data and fitting separate models to each segment than feeding the whole humungous dataset into the maw of the black box.

Stratification is, by the way, not the same as oversampling (eg oversampling of churners as in the above example) but it can be used to construct datasets that most closely match your analytic objectives - you could, for example, create a stratum of persons or calls based on the difference between their actual calls and the predicted calls - ie stratifying by residuals.

Terabyte sized hard disks are here now

Seagate has a ¾ terabyte (750gb) SATA hard disk available in Australia for about $500.

But that does not mean that we can readily work with terabyte sized files. The limits on file size, by Windows OS are:

  • 2gb for Fat
  • 4gb for Fat32
  • 2TB for NTFS, or maybe larger - it depends, see the Wikipedia article on NTFS for more information

I have worked with “huge” files in Win32, up to the 4gb limit. But 2 Terabyte files on NTFS? Seems we might be looking a using 64 bit addressing.

Which is a whole other can of worms.

Wikipedia on the desktop?

It is available for download here and here (different format).

It is “only” about 5gb of text.

Conclusion

So, gentle reader(s), what size datasets (sparse or dense) are you working on?

On the desktop? With custom or packaged software?

As always, feedback is welcome.

Comments (4)

« Previous entries · Next entries »