Mining Terabytes on the Desktop

I have always had a soft spot for the book “Managing Gigabytes” by Witten, Moffat and Bell. If memory serves me they did indeed foresee the day when we would be working with terabytes (thousands of gigabytes), more or less routinely.

Although the book is primarily about compressing and indexing documents and data, it also goes into some fascinating areas such as perfect hashing, and points up the connection between compression and modelling and the minimum description length principle (MDL) .. the idea that the best theory for a set of data is the one that minimizes the size of the theory plus the amount of data necessary to specify the data relative to the theory (the residuals, if you like). (MDL is a lovely idea, “small is beautiful”, but then in most algorithms we also have a time-space tradeoff ).

All of this comes home to me as I work on the Netflix dataset where I have about 100 million records which need to be intensively mined., and I start to wonder about what the limits of what can be done on a single desktop or a small LAN.

High Performance by fitting the data storage/retrieval to the algorithm

In applications such as this there is usually a need to fit the datastructures and data storage to the algorithm. With one of my algorithms (not unrelated to Singular Value Decomposition), sequential retrieval of the data is called for and I can achieve about a single pass through the data in about 20 seconds .. about 5 million records per second : this is NOT keeping all the data in core, but making sure that the underlying Operating System and hard drive caching work with me and not against me.

Locality of Reference, Feasibility, and Scalability

So called “locality of reference” is what this is all about.. aka, don’t ask the read heads to move about too much. In another algorithm, I need some random starting points, but can proceed sequentially from there. Again, this works OK, but only by being very careful with the file setup. Sometimes duplication of files (in different physical orders) is better than the single hard disk.

This sort of performance can never be achieved through SQL or any mainstream database, where the data of interest – the logical sets- are not physically contiguous on the hard drive.

So, who cares? Well, the issue is one of feasibility. Is it possible to use state of the art algorithms on this problem, or do I have to change the algorithms to suit my hardware resources?. Scalability is also the issue .. sure I could put another PC on the network but that only doubles my grunt power, not changes it by an order of magnitude. And some algorithms are not easily parallelizable anyway.

You can get some biggish numbers in Churn analyses too.

A colleague recently told me that he was working with more than 50 million records per day. No prizes for guessing the industry, but you get these sorts of numbers in any environment where you have electronic monitoring or natural phenomena (eg high energy physics, video monitoring, weather.. see ‘The Terabyte Challenge‘).

The Center for Customer Relationship Management at Duke University had a “Churn Response Modelling Tournament” in 2003. Only about 100,000 records and 171 variables on each (so, dense, not sparse) – you can see the specs here.

In brief:

The predictors include three types of variables: behavioral data such as minutes of use, revenue, handset equipment; company interaction data such as customer calls into the customer service center, and customer household demographics.

Customers were selected as follows: mature customers, customers who were with the company for at least six months, were sampled during July, September, November, and December of 2001.

For each customer, predictor variables were calculated based on the previous four months.

Churn was then calculated based on whether the customer left the company during the period 31-60 days after the customer was originally sampled.

The actual percentage of customers who churn in a given month is approximately 1.8%.

However, churners were over sampled when creating the Calibration sample to create a roughly 50-50 split between churners and non-churners (the exact number is 49,562 churners and 50,438 non-churners).

Over sampling was not undertaken in creating the Current Score and Future Score validation samples. This is to provide a more realistic predictive test. The Current Score data contain a different set of customers from the Calibration data, but selected at the same point in time. The Future Score data contain a different set of customers selected at a future point in time.

It is not always a good idea to work with the whole dataset

Apart from the feasibility issues, the data set you have been given may not be the best one to work with.

You may get more insights into model performance and problem areas by stratifying the data and fitting separate models to each segment than feeding the whole humungous dataset into the maw of the black box.

Stratification is, by the way, not the same as oversampling (eg oversampling of churners as in the above example) but it can be used to construct datasets that most closely match your analytic objectives - you could, for example, create a stratum of persons or calls based on the difference between their actual calls and the predicted calls - ie stratifying by residuals.

Terabyte sized hard disks are here now

Seagate has a ¾ terabyte (750gb) SATA hard disk available in Australia for about $500.

But that does not mean that we can readily work with terabyte sized files. The limits on file size, by Windows OS are:

  • 2gb for Fat
  • 4gb for Fat32
  • 2TB for NTFS, or maybe larger - it depends, see the Wikipedia article on NTFS for more information

I have worked with “huge” files in Win32, up to the 4gb limit. But 2 Terabyte files on NTFS? Seems we might be looking a using 64 bit addressing.

Which is a whole other can of worms.

Wikipedia on the desktop?

It is available for download here and here (different format).

It is “only” about 5gb of text.

Conclusion

So, gentle reader(s), what size datasets (sparse or dense) are you working on?

On the desktop? With custom or packaged software?

As always, feedback is welcome.

4 Comments »

  1. » Managing Terabytes - on line and off line [ Data Sciences Analytics ] said,

    May 27, 2007 @ 8:13 pm

    […] I have previously written about mining terabytes of data with a desktop machine.. some of the strategies that work for me. […]

  2. » psst .. want some cheap n-grams? [ Data Sciences Analytics ] said,

    July 20, 2007 @ 12:51 pm

    […] And lots of other uses. But beware .. you will need your system set up for mining terabytes on the desktop .. the corpus comes on 6 DVD’s […]

  3. Joe Smith said,

    August 6, 2007 @ 9:35 am

    If you browse around the Netflix site you will find people putting the data into structures measured in a few hundred megabytes so it is possible (for some algorithms) to fit all of the data into the RAM on a modern desktop. Simon Funk says you should have 2 gigabytes but there are obviously people doing it in less.

  4. John Aitchison said,

    August 7, 2007 @ 7:16 pm

    thanks for the comment Joe. I did at various stages through the Netflix analysis end up keeping everything in core, but at the end of the day it seems to work quite well if you keep it on disk and load the needed bits as required, making sure that the OS caches it for you. Disk caching == Virtual Memory, more or less. And yes, I do have 2gb of memory but never need to use more than about 1gb.

RSS feed for comments on this post · TrackBack URI

Leave a Comment