Archive for May, 2007

The e-Census – Can Web Surveys Be Any Good?

Australia not so long ago had its latest 4 yearly Census, and I completed it online (by e-Census they really mean Web Census, not -for instance- surveys by email, of which we have more to say elsewhere).

I was impressed with the professionalism. An excellent use of javascript and DHTML (which meant that redundant questions were not asked, and help appeared on demand), although as far as I could figure out they were not using XMLHTTPRequest technology, aka AJAX, for background transfers. And it seems as if they did not cater for javascript disabled browsers which is maybe OK in the context (they are the Census organization after all, and can force you to do it on paper if need be).

Technologically, full marks (reportedly, at a cost of a $9million contract with IBM). Lots to be learned from this about elegant information hiding for web surveys.

There is also another issue. The form was re-organized, such that the web form did not look much like the paper form and required quite a different mindset.

The web form asked the questions on a per-person basis. That is, it re-presented all the (relevant) questions for each person. The paper form is a grid .. persons across the top, questions down the side. The web form is clearly easier to use, and the paper form could have been redesigned to match that flow of questioning.

OK, they did a good job. It was easy to use, I have little doubt that the e-responses will be of high quality (higher than paper, fewer logical inconsistencies..). I suspect that most e-respondents felt that it saved them time.

Having a high profile organization such as the Australian Bureau of Statistics endorse e-datacollection is surely of some comfort to commercial e-survey service providers. Maybe if people get used to doing official e-surveys, they will enjoy and do a conscientious job of answering commercial e-surveys. Maybe.

And certainly the commercial e-surveys can be sexed up .. that is, use the javascript/DHTML/information hiding techniques exemplified by the ABS e-Census.

But the Census is the Census, and it has some legislative force behind it. Unless you are particularly bloody minded, you are unlikely to be overly cavalier about answering the questions, no matter what your private opinion of the quality/relevance/fatuousness of a particular question (or of the entire study) may be.

But a web survey from some private supplier?

Do people take it seriously?

Will they start answering randomly?

just move that mouse and click on anything?

So, let’s get some smarts about this.

Let’s assume that in the absence of face-to-face interaction (umm, interviewers?) , in the absence of the fear of legislated penalty, people are going to be pretty careless about how they answer. Probably not going to give too much of a damn if they do the right thing, are going to get bored pretty quickly, gunna want to move right on through to the end.

Sure, you can do smart things to make the process more enjoyable and better tailored to the particular e-respondent and his/her particular circumstances.

But that won’t eliminate the cavalier approach, and it won’t get us away from the fact that web based surveying is likely to yield lower quality data than more expensive data collection methods.

Web surveys are here to stay (the cost advantage, the immediacy) so we had better start think about dealing with data that is a lot dirtier than we would like.. samples that are as not nearly as representative as we would like, question responses that show clear evidence of fatigue and “could not care” effects.

How much can we trust web surveys? Seems to me that is the critical question.

  • Obviously they are methodologically flawed.
  • Obviously we need to do everything we can (javascript tricks to reduce load and boredom) to make the user experience better, and quality of answering less random
  • Obviously we need to randomize the order of presentation of questions where we can
  • Obviously we need to do what we can to get a well constructed stratified random sample, albeit of web users
  • Obviously we need to use whatever data we have to help us weight the survey results to generalize to the population at large, not just the web user (or paid survey respondents) population

At the end of the day, are you going to trust this data and the analysis that flows from it?

Well, yes, to a degree.

All samples are bad.

All questions are silly/fatuous/unrealistic/semantically flawed.

All analyses suffer from unexplicated data dredging.

All reported results are colored by a desire to please.

There is no justice and no truth. Insisting on (excessive) rigour in data collection procedures diverts resources from rigour and creativity in framing the enquiry, and from time and mental energy in exploring the dataset and its implications.

Bad data, dumb questions, iffy samples : we are stuck with them.

So? So what? Just more sources of uncertainty.

Bad data is better than no data, creative analyses of awfully flawed datasets can still yield insights.

So, if we have to deal with “web surveys”, let’s make sure that they are conducted as well as possible and then get on with the job of seeing how this fuzzy and flawed data modifies our priors, what lessons and surprises and insights there may be.

Comments (2)

BlogMining Thoughts

I have posted on somewhat similar topics before - specifically
“Text Mining and Ifilters” and Mining the Spirit of the Times – The Zeitgeist and The Buzz, but I figure I had a little more to say specifically on the topic of blogmining.

Ok, this does not have a lot to do with extracting text from difficult formats - because an RSS “feed” for a blog is usually available (meaning you can get new articles in industry standard XML format) although access to archived material might mean you need to write an HTML downloader/parser.

Sampling and PostProcessing

It does have a lot to do with sampling - which blogs are you going to monitor, and how “representative” are they? And to do with post processing - how are you going to go about looking for what you are looking for, and do you know what you are looking for or is it just “something interesting”.

On the surface of it, mining blogs is a “good thing”. Lots of “content” there, right?

But I am not so sure that it is not more hype than reality at this point.

Just what are you attempting to find out from blog mining?

Detecting changes or novelty? .. that is a big ask. Not impossible, but a big ask.

Text classification is a difficult enough task even when you have well defined categories.. see for example all the work done on news story classification (or even spam detection for that matter, a problem that should be “easy”).

But if your objective is as inchoate as “I want to see what is happening, what is hot, what the trends are .. “ then I guess you had better be prepared for a difficult road ahead, one that requires a lot of creativity and sensitivity in the analysis.

This is not a job for the engineers, the miners, the guys with the heavy machinery that can bulldoze through a million blogs and reduce them to a “bag of words” (yes, the fundamental concept in text classification).

Novelty Detection?

Novelty detection, btw, is an interesting area of statistics .. lots of applications eg in image analysis. And there is a classic paper in the text classification field “A Hierarchical Probabilistic Model for Novelty Detection in Text (1999)” by L. Douglas Baker, Thomas Hofmann, Andrew K. McCallum, Yiming Yang that would be worth having another look at.

Opinion Mining is another area of interest, worth googling.

Well, I will let you make up your own mind how far this blog mining field has developed and how far it has to go, and whether it has gone beyond the facile word counting to something that could be useful for you .. some links FYI.

  • CRM News: Customer Service : Blog Mining Gets Real
    “Like unstructured content captured on Web forms that never really gets used, blogs’ explosive growth is generating raw data sets that your company really can’t afford to ignore. At the beginning of the year blogs were considered by many industry watchers one of the top ten trends. It’s becoming very clear that blog mining is certainly part of that mix.”
  • RelevantNoise.com » 5 Reasons Why Your Company Needs to be Concerned About Blogs
  • Al Bredenberg VoIP & CRM Blog: Mining the Blogosphere
  • Talking To Consumers Blog Mining
  • Nielsen BuzzMetrics blogpulse can be found at http://blogpulse.com/

Comments

· Next entries »