Archive for July, 2007

Traps for (young) data miners

I think we can all do with reminders, from time to time, of the many traps and pitfalls that there are in data analytics.. so I thought it worth passing on a link to this article Identifying and Overcoming Common Data Mining Mistakes via IAPA. . ( If you are an Australian statistician/data miner/ data analyst , then I reckon it is a good idea to join it and subscribe to the alert service .. there is often an interesting meeting or two floating around, particularly in Sydney or Melbourne).

So, back to the paper mentioned above. It is not rocket science nor even the state of the art, and yes it does (unsurprisingly) talk about SAS and how SAS does things, but I liked it because it was not doom and gloom.

Yes, there are issues. Yes, there are many things we all need to know - to internalize, to feel comfortable that we are doing “the right thing” while still allowing us to put our intuition to work. And it is often hard to find our way around the plethora of techniques and approaches, even in SAS .

So, a well written, sensible and encouraging article like this is of value. Never mind that the headings identify “problem areas” - it is generally upbeat.

One of my current interests is in variable selection. I did not find much of technical interest on that topic here, but it did at least emphasize the role of judgement. Variables are not important just because a (naive and maybe one off) selection procedure says they are .. conversely, variables deemed unimportant otherwise may be important because YOU (with some logical/theoretical/gut feel backing) say they are.

And the paper had some nice discussion on what to do with the very common problem of infrequent positive outcomes (that is, your dependent variable takes on 0 about 90% of the time and 1 less than 10%) and how to weave costs into that.

It is worth a read, imho.

And here, to pique your interest, are some of the headings/identified problem areas.

  • Preparing The Data
  • Failing To Consider Enough Variables
  • Incorrectly Preparing Or Failing To Prepare Categorical Predictors
  • Too Many Overall Levels
  • Levels That Rarely Occur
  • One Level That Almost Always Occurs
  • Incorrectly Preparing Or Failing To Prepare Continuous Predictors
  • Extremely Skewed Predictors
  • A Spike And A Distribution
  • One Level That Almost Always Occurs
  • Ignoring Or Misusing Time-Dependent Information
  • Defining Roles, Performing Sampling, And Defining Target Profiles
  • Inappropriate Metadata
  • Inadequate Or Excessive Input Data
  • Inappropriate Or Missing Target Profile For Categorical Target
  • Target Variable Event Levels Occurring In Different Proportions
  • Differences In Misclassification Costs
  • Partitioning The Data
  • Misunderstanding The Roles Of The Partitioned Data Sets
  • Failing To Consider Changing The Default Partition
  • Choosing The Variables
  • Failing To Evaluate The Variables Before Selection
  • Using Only One Selection Method
  • Misunderstanding Or Ignoring Variable Selection Options
  • Replacing Missing Data
  • Failing To Evaluate Imputation Method
  • Overlooking Missing Value Indicators
  • Overusing Stepwise Regression
  • Inaccurately Interpreting The Results
  • Ignoring Tree Instability
  • Ignoring Tree Limitations
  • Failing To Do Variable Selection (Neural Networks)
  • Failing To Consider Neural Networks
  • Comparing Fitted Models
  • Misinterpreting Lift
  • Choosing The Wrong Assessment Statistic
  • Generating Inefficient Score Code
  • Ignoring The Model Performance
  • Building Only One Cluster Solution
  • Including (Many) Categorical Variables
  • Failing To Manage The Number Of Outcomes

Comments

Finding your way around R

I have been meaning to write this for a while, but came across a great collection of resources and guides at The Net Takeway and some useful links at R - Loyalty Matrix

I strongly recommend simpleR Using R for Introductory Statistics

There is a wiki for collaborative documentations, seems to be down at the moment.

And of course the “task views” .. an overview of which R packages might be relevant to you in a particular field.

There is not a “task view” for time-series, but this paper by Ricci is quite a good starting point.

The “Ion TikiWiki” has a lot of great R code and articles (in the forecasting and stock market area) - see for example this article on testing for random walks. Here is the index to the articles .. really worth browsing.

I have prepared some notes on time series packages and R which I will publish here some day. Also some on cluster analysis.

A short post, I hope to do more on the subject. R is a great system, anarchic to a degree and it is not always easy to find your way around it.

Comments

« Previous entries ·