Jun 9 2008

Bayesian filters

Was trying to say something clever abour bayesian filters, however there is no way anyone would be able to describe bayesian filters better than SPAM filter guru Paul Graham.

“To the recipient, spam is easily recognizable. If you hired someone to read your mail and discard the spam, they would have little trouble doing it. How much do we have to do, short of AI, to automate this process?

I think we will be able to solve the problem with fairly simple algorithms. In fact, I’ve found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives.

The statistical approach is not usually the first one people try when they write spam filters. Most hackers’ first instinct is to try to write software that recognizes individual properties of spam. You look at spams and you think, the gall of these guys to try sending me mail that begins “Dear Friend” or has a subject line that’s all uppercase and ends in eight exclamation points. I can filter out that stuff with about one line of code.”

Full article: http://paulgraham.com/spam.html

And in contrast I came acoross a great comment about the problems byasian filters are up against.

[..] “Bayesian filters, which is one layer of defence in an anti-spam system, attempts to classify mails as “good” or “spam” based on content. It studies samples of mail received in your inbox and builds a database of words found in spam mail (“viagra, penis, enlarge etc”) and words found in good mail (normal words). Thereafter, it analyses mails and if it finds words in it that are prevalent in spam, it marks it as spam – but on the other hand, if it finds words in the mail that are normally found in good mail, it marks it as good.

By including random normal words in their spam – that make no sense in the context of the spam, but are perfectly good words otherwise, the spammers are trying to fool the bayesian filter into classifying these words as spam. once this happens, legitimate mail with these words will get classified as spam, leading you to doubt the accuracy of the spam filter, and eventually (they hope) get you to disable it as it becomes progressively useless.. That’s the theory at least.”

Peter Theobald (Mumbai, India) on http://askbobrankin.com/nonsense_words_in_spam.html