I am currently working on a clustering problem which is large, but more importantly, ill-defined – as are so many problems of this type.
There is a large number of cases (thousands to tens of thousands) but the mechanics of handling that issue is not too much of a concern, even given that I use one of the more computationally intensive “model based clustering” approaches eg Multimix, Mclust.
The real issue is in my variables .. how I represent the cases. I am at liberty to do this however I Iike, limited only by my imagination, and the nature of the data is such that I could end up with thousands of variables.. they could be the simple raw measurements, ratios of the original variables, model coefficients.. .
So it is not a “variable selection” problem of the more familiar sort .. it is a variable creation problem. And it is not the “curse of dimensionality”, it is the prospect of creating that curse via our own feature creation process, which is the problem.
Or is it? Is it a problem of “too many variables”?
We can of course sidestep the issue and compute similarity between two cases (which is sufficient for many clustering algorithms) without overtly creating “variables”. Think, for example, of the similarity of two documents .. that could be computed just by a simple count of the number of words in common… we then don’t need to explicitly create “variables”.
But of course we don’t really “solve” the problem this way, just sidestep it, and in so doing create another problem (dependence on algorithms that work with pairwise distances (similarities), and the problem of the size of the resultant matrix).
These issues are canvassed quite nicely, from a statistical viewpoint, in the paper
“Mixture Model Clustering using the Multimix Program” by Lynette Hunt and Murray Jorgensen from the University of Waikato.
..but, what does it MEAN?
It is not the statistical or mechanical aspects that we are concerned with – it is the representational ones.
Statisticians commonly sidestep this : the stance is effectively “well, given the dataset consisting of these cases described by these variables, this is an optimal clustering”.
Even given the variables, the representational issues are still there : what if some variables are more relevant/meaningful/actionable than others?
I conclude with a couple of quotations
From: rbanerji@sjuphil.sju.edu (Ranan Banerji) quoted at “Online Software For Clustering”
All my life I had a problem with clustering.
Any clustering method is based on some idea of similarity, proximity etc., be they numerical, symbolic or whatever.
This similarity is determined by what the researcher considers similar.
Very often in an application area we need to think of two objects as similar when they demand similar action, or some other problem dependent criterion of similarity.
Whenever I have looked, it has seemed to me that the similarity imposed by the problem and the similarity imposed by the intuition is not the same. So the problem lies in getting a match between the two measures.
The problem of computational complexity (which seems to be the thing bothering you) comes way after that. Refining the clustering method (to somehow get around the mismatch) is what gives rise to the complexity.
I have spent my life trying to develop and improve methods for getting the correct match, i.e to solving the so-called “representation problem”. My own advice would be, concentrate on sharpening your intuition of the problem so you can prove to yourself that your measure matches the measure imposed by the problem. Once you have done that, any fast-and-easy technique of clustering will work.
and
Measuring similarity: consider invariance
Somehow in the last couple days I was talking to different people about automated ways of finding similar content. One was thinking about finding similar documents in a certain text domain (that I couldn’t reveal yet). Another was just thinking out loud about finding similar images (in a domain that… um… let’s just say makes a lot of money and is consistently an early adopter of video technologies…).
The interesting thing is that both person have some computer science background but no specialty in pattern recognition or statistics. They’ve heard of cases where finding similar documents or images have worked, so they imagine that the general problem has been solved, and there shouldn’t be any difficulty to their specific application.
The fact of the matter is much more nuanced than this. It’s quite easy to invent some heuristic measure that calculates the similarity of two documents. You can apply it to your document set, and in fact it does return documents that are similar in some way…
Now, if the heuristic turns out to measure similarity in exactly the way you’ve intended it, then life is great. You give the heuristic some fancy name to impress your boss or the media, and you’re done.
Unfortunately, all too often there’s more than one way to consider similarity, and your measure ends up being influenced by all these other ways. Take text documents for example. People tend to think of similarity as topical similarity, but documents can in fact be similar for other reasons. They may have similar length, and they may have similar style if they’re from the same author, publisher, or genre (news, blogs, academic papers, etc.). Take facial images as another example, two images can be considered similar if they’re taken under similar lighting conditions, even for completely different subjects. Humans have a natural bias to ignore those kinds of similarity, but objectively they’re as legitimate as any other ways of considering similarity.
Most of the development work will actually be in tweaking the similarity measure so that it’s invariant to all these other kinds of similarity. You may need to normalize the input (against document length or lighting condition), and you may have to consider a totally different set of input features that are more robust. Successful cases in restricted domains (e.g., finding similar articles within a news site) are certainly good places to start your heuristic, but don’t underestimate the potential amount of work needed to adapt to a related problem domain.
Conclusion
Actually, I think it is even more nuanced and complex than either of these authors suggest.
I doubt whether invariance can be reasonably achieved, and I doubt that it is simply a mechanical problem .. I think it is tightly coupled with the representational problem.
And since we have no mechanism for saying whether one cluster solution is “better” than another (except in a statistical sense, given a certain set of variables), or any real way of saying which solution is “more meaningful” (or useful, or actionable), cluster analysis remains something of an art form where analyst skill is to the fore, and the nuances are all.