March 21, 2007 at 11:03 am
· Filed under musings
Well, I am being a little provocative here because they do work to a degree, but recently I have been finding more and more slipping through (I use K9 from keir.net), and a bit of discussion on the web about the possible ineffectiveness of Thunderbird’s “Bayesian Spam Filtering”.
Those that are slipping through, most recently, seem to be characterized by a lot of meaningful (non junk, not spam keywords) text and a single gif attachment. So it looks like a non spam message.
This is interesting, and a good example of why I like the spam problem… there is an active antagonist out there (the spammer) who changes the rules. A while back it was the deliberate insertion of spaces or junk characters while still making the text readable.
I don’t know much about what Thunderbird has done with their spam filter or whether it is “Bayesian”, but there are clear problems with spam filters as they currently stand.
In particular, they seem to have inadequate memory (a message that is VERY similar to a previously identified spam message should be marked as spam with very high probability), inadequate robustness to filler junk, inadequate robustness to deliberate mis-spellings, inadequate feature construction (yes, a message that has a very short subject line should be penalized if it is not pre-whitelisted).
In short, I think that the “Bayesian” “bag of words” paradigm is inadequate, as is the lack of flexibility for the end user to design their own situation-specific filters that are not purely if-then-else based.
Permalink
March 21, 2007 at 10:50 am
· Filed under musings
Although I am not totally enamored of statistical graphics or their ability to convey nuances (sometimes words rule, OK?) I am indebted to Shane’s Blog for a pointer to
the R Graph Gallery
which has a whole bunch of nice graphics produced in R, together with full source code.
There is a thumbnail gallery from which you can go to more details and the source code for each of the graphics.
Here are some of the ones I like:
Violin plot- similar to a boxplot except that they show the density of the data, estimated by a kernel method (uses package SimpleR)
Boxplot and friends (uses packages Hmisc, hdrcde, vioplot)
Conditional Regression Tree (uses package Party)
Spinogram - an extension of a histogram
Mosaic plot (uses package vcd)
Conditional density plot (uses vcd)
ScatterPlot3D
Graphical representation of a sound wave (uses seewave package)
Geographic Cluster representation/thematic maps (uses packages cluster, maps, RcolorBrewer)
Other R graphics (including 3D)
GraphViz
is a favorite of mine .. an open source graph (network) visualization project from AT&T Research. The key apps here are dot and neato.
There is a very nice gallery here
And you can interface to it from R with RGraphviz
Update - more ideas
The Gallery of Data Visualization presents some very creative approaches, including
-
the Reorderable Matrix.
- the Hanging Rootogram
- the Bagplot: a Bivariate Boxplot
- the Anamorphic Map
- the Chi-square Map
- the Multivariate Star Plot
- the Enhanced Scatterplot Matrix - enhanced with a data ellipse showing the strength of the relationship
- the CorrGram
- the Spie Chart — a comparison of two pie charts
Some good ideas there, as long as it does not turn out to be more work to interpret the graphic than it is to look at the data itself.
Permalink