Why don’t I write more about “statistics” ?

I am occasionally asked something along these lines .. .

After I get over my initial surprise (because it seems to me that is just what I DO write about, in an applied sort of way) then I get to thinking that maybe the questioner means “why don’t I write (more) about statistical theory?” or “why don’t I write (more) about (the latest )/(basic) analysis techniques?”.

Sort of a “If you are a police dog, where’s your badge?” type of question .. well, maybe not, but it was a good opportunity to use a Thurber quote.

It’s a fair enough question, so here goes

· my interest is in problems, not techniques. There are people who do that stuff (technique and theory development) for a living – make better analysis algorithms, refine the approaches – but I don’t like to do that in abstract, only in response to a specific concrete problem or in response to my perception of what may be done about a class of problems.

· So, I don’t have an abstract interest in making a better cluster analysis algorithm – only in building such a thing, if needed, for something that comes up or for something that has come up a lot. Cluster analysis as it is routinely applied to market survey research data can certainly be improved and that intrigues me (because of the context, the special considerations) : adding yet another generic algorithm to the arsenal of such techniques does not.

· there is, imho, already too much theory and too many algorithms out there. More than one person can deal with – so we have specialists in an already specialized field. Theory and algorithms and analytic procedures abound : data abounds, we are swamped in data – matching them up is the issue.

· it is arguable that a Pareto principle is at work .. the 80/20 idea.. we can get a long way with relatively accessible (as in, understandable, defensible, explainable) procedures

· I don’t like to see data going begging because clients have an impression that some rocket science is required. There are many many problems that need to be solved and are not even addressed. Routine automated forecasting being one such .. not many companies do this, when if we leave aside all the henny-penny arguments about rare events and all the arcana about the precise Box-Jenkins model formulation , we can do very well with some simple approaches. Better that, than that nothing be done at all.

· There is plenty of evidence that simple solid approaches work well, most of the time. Take a look at the success of machine learning rules 1R and T2 (which are about as simple as you can get), and of the very solid performance of known and accessible procedures in the M3 forecasting competition.

· That is not to say that complex situations do not call for more sophisticated approaches : they do. But I don’t want to cast my lot with the promoters of the latest and greatest – so I don’t want to talk about the “success” of the latest “breakthrough” technique

· And there are at least two issues with leading edge techniques. a) the extent to which the results, achieved on a narrow domain, can be generalized and reproduced ( why should we believe that results about medical research “Most published research false?” “http://www.jr2.ox.ac.uk/bandolier/band139/b139-2.html do not apply to statistics and data mining? and b) implementability

· I will take that last point and expand on it a bit. If I am going to promote, to my clients, say “Support Vector Machines” (SVM’s) then I had better be rock solid certain that I can implement the approach in code and that the results will be robust over time and circumstances (eg different measurement vectors, missing data etc). The SVM concept is elegant, absolutely, go look it up on Wikipedia and be enthralled .. but can I say that I am convinced that it should form part of an ongoing classification system? Not yet .. I have written SVM code, but rule discovery is not the problem as much as rule implementation . and here I have some worries about SVMs

· …and further. I have on my desk the very recently published “Bayesian Statistics and Marketing” by Peter Rossi. I have read it. I have not digested it. There is a lot of heavy computation involved in the approaches discussed and some of it goes towards an interest of mine .. the building of individual choice models and the subsequent use of those in segmentation.
Am I convinced? Should I tout this as a breakthrough? Well, not yet is the answer. It could, indeed, be years before I say YES YES GO FOR IT! .. my sense is that there is something there, but I will have to do a lot more work before I am going to start directing the attention of clients towards those possibilities.

I guess the bottom line is that I think that advanced theory and new algorithms can be a distraction from the main game.

The main game is using that vast armory of techniques we now have (in R, SAS, SPSS..) to use sensible procedures to come up with implementable, computable and solid systems.

That’s Data Analytics.

And that is why I don’t talk statistics so much

Leave a Comment