Archive for February, 2007

“Wisdom” and Statistical Modelling

Kind of a strange topic, perhaps.

But for some reason I was re-reading “Bayesian Forecasting and Dynamic Models” by Mike West and Jeff Harrison on a quiet Sunday afternoon, and found myself struck by the evident “wisdom” in their introductory comments (some similarities perhaps with the tone and content of Frank & Witten’s “Data Mining”, but I will leave that for another day).

I don’t much like statistics as technique or as over explicated assumptions of convenience : I love statistics as thought and learning.

So here are some quotes from the West and Harrison book, for your interest

A time series model is essentially a confession of ignorance, generally describing situations statistically without relating them to explanatory variables.

The key message for the modeller is “THINK, and do not sacrifice yourself to mathematical magic”.

..data analysis should not replace contextual thinking, nor should it promote mathematical formulae that have no defensible applications

All analytic methods have some contribution to offer, but they must be seen as servants of explanatory thought and not its usurper

.. modellers may, without question, accept sales statistics as their routine observations. The real objective is to forecast customer requirements. Clearly, sales statistics represent what is sold. As such they reflect the ability of the system to meet requirements, and not necessarily the actual requirements themselves.

A major problem for many mathematically based learning systems has been that of accommodating subjective information. This is particularly important at times of major change. .. The vital role of a model structure facilitating expert intervention .. Usually, at times of intervention, there is additional uncertainty with an accompanying sharp change in beliefs about the future.

factors that dominate a system at one level of aggregation may be insignificant at other levels, and vice-versa

a model should focus on a limited range of questions and not try to answer both macro and micro questions ..

Well, there is more along these lines in there.

Comments

Sensible Analysis of Censored Rank Data

This problem commonly arises in market research data, and the solutions I have seen to date just don’t seem to cut it (typically, they throw away too much data or do some averaging of ranks which I just don’t believe in).

The problem has something in common with MADM (Multi Attribute Decision Making) , which is a very interesting and challenging area in its own right .. google on ELECTRE if you are interested, or have a look at the paper “Ranking Projects Using the Electre Method” by John Buchanan www.esc.auckland.ac.nz/organisations/orsnz/conf33/papers/p58.pdf

However, MADM is not directly applicable, so we need to forge ahead on our own.

The problem : I have censored rank order data .. electors have been asked to rank the 4 most important issues out of a list of 20.

For each individual we therefore have a vector of 20 measurements 1..5, where 1..4 are ranks and 5 = not ranked/less important than the nominated 4.

I would like to be able to use the full information in the ranking, not just rely on first preferences. Nor do I want to average ranks, which appears to be common practice.

I would like to make statements of the sort ‘I am at least 80% confident that the most important issue is “B” ‘.

The immediate thought is some sort of simulation/bootstrapping, which should be straightforward enough if I use just the first rank to denote “most important”… but that ignores the information contained in the lower ranks.

My next thought was that I should attempt some form of ordination .. some unidimensional scaling using perhaps a 1 dimensional cmdscale solution .. this rests on the ability to build a suitable distance matrix, which I think is possible. Or maybe a form of Thurstone scaling.

Finally, we might build a model. A suggestion (courtesy of Jonathon Baron) is to build a model in which each issue has a mean and standard deviation for its utility .. the ranks for each subject are based on the 20 numbers drawn from the model, and to fit the model we need to optimize some error function (mismatch of expected and predicted ranks). That sounds promising.

If I get the time I will build an artificial dataset and see how the various approaches perform.

In the meantime, if your MR supplier assures you that “XYZ is the MOST IMPORTANT issue”, question her/him a bit more closely as to how she did the analysis (and whether it was just based on first ranks or silly averaging).

Update: there are some possible clues in the “choice modelling” literature

http://links.jstor.org/sici?sici=0735-0015(199901)17%3A1%3C117%3AMLMIPR%3E2.0.CO%3B2-X

http://www.springerlink.com/index/GV2264QWG1476842.pdf

Comments

« Previous entries ·