BlogMining Thoughts
I have posted on somewhat similar topics before - specifically
“Text Mining and Ifilters” and Mining the Spirit of the Times – The Zeitgeist and The Buzz, but I figure I had a little more to say specifically on the topic of blogmining.
Ok, this does not have a lot to do with extracting text from difficult formats - because an RSS “feed” for a blog is usually available (meaning you can get new articles in industry standard XML format) although access to archived material might mean you need to write an HTML downloader/parser.
Sampling and PostProcessing
It does have a lot to do with sampling - which blogs are you going to monitor, and how “representative” are they? And to do with post processing - how are you going to go about looking for what you are looking for, and do you know what you are looking for or is it just “something interesting”.
On the surface of it, mining blogs is a “good thing”. Lots of “content” there, right?
But I am not so sure that it is not more hype than reality at this point.
Just what are you attempting to find out from blog mining?
Detecting changes or novelty? .. that is a big ask. Not impossible, but a big ask.
Text classification is a difficult enough task even when you have well defined categories.. see for example all the work done on news story classification (or even spam detection for that matter, a problem that should be “easy”).
But if your objective is as inchoate as “I want to see what is happening, what is hot, what the trends are .. “ then I guess you had better be prepared for a difficult road ahead, one that requires a lot of creativity and sensitivity in the analysis.
This is not a job for the engineers, the miners, the guys with the heavy machinery that can bulldoze through a million blogs and reduce them to a “bag of words” (yes, the fundamental concept in text classification).
Novelty Detection?
Novelty detection, btw, is an interesting area of statistics .. lots of applications eg in image analysis. And there is a classic paper in the text classification field “A Hierarchical Probabilistic Model for Novelty Detection in Text (1999)” by L. Douglas Baker, Thomas Hofmann, Andrew K. McCallum, Yiming Yang that would be worth having another look at.
Opinion Mining is another area of interest, worth googling.
Well, I will let you make up your own mind how far this blog mining field has developed and how far it has to go, and whether it has gone beyond the facile word counting to something that could be useful for you .. some links FYI.
-
CRM News: Customer Service : Blog Mining Gets Real
“Like unstructured content captured on Web forms that never really gets used, blogs’ explosive growth is generating raw data sets that your company really can’t afford to ignore. At the beginning of the year blogs were considered by many industry watchers one of the top ten trends. It’s becoming very clear that blog mining is certainly part of that mix.” - RelevantNoise.com » 5 Reasons Why Your Company Needs to be Concerned About Blogs
- Al Bredenberg VoIP & CRM Blog: Mining the Blogosphere
- Talking To Consumers Blog Mining
- Nielsen BuzzMetrics blogpulse can be found at http://blogpulse.com/