Why Bayesian Spam Filters are Doomed
Well, I am being a little provocative here because they do work to a degree, but recently I have been finding more and more slipping through (I use K9 from keir.net), and a bit of discussion on the web about the possible ineffectiveness of Thunderbird’s “Bayesian Spam Filtering”.
Those that are slipping through, most recently, seem to be characterized by a lot of meaningful (non junk, not spam keywords) text and a single gif attachment. So it looks like a non spam message.
This is interesting, and a good example of why I like the spam problem… there is an active antagonist out there (the spammer) who changes the rules. A while back it was the deliberate insertion of spaces or junk characters while still making the text readable.
I don’t know much about what Thunderbird has done with their spam filter or whether it is “Bayesian”, but there are clear problems with spam filters as they currently stand.
In particular, they seem to have inadequate memory (a message that is VERY similar to a previously identified spam message should be marked as spam with very high probability), inadequate robustness to filler junk, inadequate robustness to deliberate mis-spellings, inadequate feature construction (yes, a message that has a very short subject line should be penalized if it is not pre-whitelisted).
In short, I think that the “Bayesian” “bag of words” paradigm is inadequate, as is the lack of flexibility for the end user to design their own situation-specific filters that are not purely if-then-else based.
deanabb said,
March 21, 2007 @ 11:24 am
amazing how you and I can recognize spam in an instant, even the new and “improved” ones. But the algorithms get fooled all too often. Now there are a few emails I get that I’m not actually sure if they are spam or something I somehow signed up for once, but these are really the problem for me. But then again, there are many problems like this out there that we haven’t found algorithms to crack–language translation and image classification are just two examples.
Shane said,
March 22, 2007 @ 10:23 am
Bayesian approaches may not be the best algorithm for the job but there is also the issue of having enough training data… which is what I was alluding to in my gmail post. Systems like SpamAssassin also seem pretty interesting.