Scalability, Similarity, Geo-Semantics and Visualizations
I have not posted much lately because I am heavily involved in a geo-clustering project - more accurately, I am still involved in data validation, feature construction, feature visualization and feature selection for that project .. all of which are interrelated (so I have several apps under development) and all of which relate to scalability.
Scalability simply meaning - OK, I can get this or that approach working on a “sample”, but how is it going to work - specifically how long is it going to take - on the full dataset of tens of thousands of cases. A few seconds per case is not going to cut it - it just won’t scale. And remember this is at the exploratory stage (construction, selection, visualization etc) so none of this is going to be run “just once”.. apart from inevitable errors, I am rethinking approaches and discovering nuances as I go along. So, scalability rules.
Why not just (sub) sample?
Well, because it is circular. How do I know how to stratify my full dataset (so that I can grab an appropriately stratified balanced or deliberately unbalanced sub sample) if I don’t know the variables I should stratify on? I could, of course, make a judgement call - but then that is going, imho, to determine the nature of the outcomes.
I don’t know what I am looking for, I suspect it will not be in the great central mass of the data but in some satellite/iconic/extreme clusters. So, I don’t want to sample too much, and scalability rules, OK?
Some Interesting Other Stuff pt 1 - GeoSemantics and GeoOntologies
This via Similarity-Blog which has a bunch of interesting references on
- Similarity and Case-Based Reasoning
- Similarity and Clustering
- Structural-Context Similarity
- Similarity and graphs
- Similarity and context
and similar stuff
Now, did you even know there was such a thing as GeoSemantics?
Me neither.
Maybe you are not interested, or think that your datasets don’t have any geographics reference embedded or embeddable in them (but surely there is something fundamental about geo-references, and maybe the reason you don’t is the complexity and expense and sometimes less than compelling usages of GIS’s : let’s face it, who wants to look at yet another colored “thematic map”) .. but, have a think about it. There is an edge here.
But if you consider this list of topics for the upcoming GeoS 2007
Geospatial semantics is an emerging research theme in the domain of geographic
information systems and spatial databases.Geospatial semantics play an important role for next-generation
spatial databases and geographic information systems, as well as specialized geospatial web
services. This conference will bring together researchers whose expertise will address such
issues as:
• Theories for geospatial semantic information
• Formal representations for geospatial data
• Models and languages for geoontologies
• Alignment and integration of geoontologies
• Integration of semantics into spatial query processing
• Similarity comparisons of spatial datasets
• Ontology-based spatial information retrieval
• Ontology-driven GIS
• Geospatial Semantic Web
• Multicultural aspects of spatial knowledge
you can start to get the idea.
Personally I find controlled vocabularies interesting .. I have posted on that (or maybe I haven’t got to it yet) and ontologies are an extension thereof imho.
Semantic Similarity (and)
This is from similarity-blog - topics for a workshop.
I find this truly enlightening .. people talking not only about the mechanics of similarity computation, but bringing into the very concept of “similarity” (and hence clustering/visualization outcomes) things like context and goals and structure.
And it is very true that “Concepts are not bags of features”.. correct. That is why we need constructive induction, and quite probably models as inputs rather than outputs.
Semantic Similarity (and)
•Time:
Concepts evolve over time and therefore also their similarity.
•Context:
As Goodman puts it, there is no meaning of similarity without defining its respects.
•Goals / Affordances:
Beside context, the goals and abilities of the user have influence on similarity.
•Structured Representation:
Concepts are not bags of features, but have a structure that influences similarity.
•Representation Extraction:
How to extract dimensions for geometric similarity approaches out of databases?
•As Compromise:
The role of similarity in decision support systems involving several users.
•Generalization:
Levels of abstractions and their influence on similarity.
•Description Logics:
How to measure similarity between DL-concepts
•Activation/Artificial Neural Networks:
Can we use neural networks as activation & alignment structures for similarity?
Some Interesting Other Stuff pt 2 - Beautiful Visualizations
from ABeautifulWWW.com - have a look at this page “Another Visualization of the Netflix Prize Dataset” and look around the site. You will be impressed.
Todd gives the technical references on the above page, but I will reproduce them here
The similarities were computed using the measure found in Sarwar, et al:
http://www.ra.ethz.ch/CDstore/www10/papers/pdf/p519.pdf
The ordination was done using the VxOrd algorithm (best-in-show for cluster visualizations)…
http://www.cs.ubc.ca/~tmm/courses/cpsc533c-04-spr/readings/clusterstab.pdf
The second reference is actually entightled “Cluster Stability and the Use of Noise in Interpretation of Clustering” which sounds technical but actually it is an easy read and describes vxInsight which has some nice features .. effectively an ordination algorithm (places things in the X,Y plane - something like a Self Organizing Map, with echoes of Simulated Annealing) with a “Terrain” overlay, the height of the terrain representing densities - have a look, it is interesting and pretty.. I wonder how it scales though?
Oh, and btw, take note of the correlation transform(s) he uses .. these have been used by a number of people working on Netflix.
There is a bunch of other good stuff at A Beautiful WWW .. have a look at the Wikipedia map for instance.
These visualizations do have commercial relevance, and will increasingly do so imho
John Aitchison said,
September 12, 2007 @ 5:48 pm
Some more visualizations
Maps with CSS
http://www.emanuelblagonic.com/2007/09/08/css-map-in-practice/
and
http://www.emanuelblagonic.com/2007/02/02/css-map/
http://www.smashingmagazine.com/2007/08/02/data-visualization-modern-approaches/