We mention them all the time. We look for them as a feature in our anti-spam products. But do we know what they are, or are they just another black box in our infrastructure? For many an experienced admin, Bayesian filters may be old hat, but for others, it is a term easily used but not fully understood. This article will crack open the box for those who are curious about just what the heck a Bayesian filter actually is, what it does, and how it works.

Let’s start with a little vocabulary that is used when we discuss Bayesian filters and spam in general.

### Spam

Unsolicited Commercial Email, or messages that were neither requested nor welcomed, and generally are an attempt to sell you something.

### Ham

Email that the intended recipient would like to receive, but that was identified as spam. See false positive.

### False positives

Legitimate email identified as spam, sometimes called ham.

### False negatives

Spam that is classified as legitimate email and passed to the user’s inbox.

Rev. Thomas Bayes

Bayesian filters’ namesake, Reverend Thomas Bayes was born over two hundred years before the technology that uses his theorem was created. He was a Presbyterian minister and mathematician who lived in England in the 1700s, studied mathematics and theology at the University of Edinburgh, and became a Presbyterian minister. He wrote a mathematical treatise, published posthumously, that defended Sir Isaac Newton’s calculus, as well as a respected theological text.

However Bayes is best known for his theorem on probability. Bayes’ theorem is also called the theorem of probability of causes.

In short, it states that if you consider an event where A_{1}, A_{2} … A_{N} are all mutually exclusive events which could have caused B, then the sample space S = U^{n}_{k=1}, i.e., one of these events has to occur. Bayes Rule gives us the probability of event B, and is expressed as:

The probability of event A given event B (e.g. the probability that an email is spam because it contains one or more keywords associated with spam) depends not only on the relationship between events A and B but also on the marginal probability of the occurrence of each event.

Bayes’ theorem is used by Bayesian filters to calculate the probability that an email is spam based on the likelihood that any individual email is spam, the likelihood of the presence of certain word in spam, the likelihood of the presence of that same word in ham, and other traits such as links to sites from other domains or known spam domains, etc. If that makes your head spin (and it does mine) then let’s simplify this with a practical example.

This example is just using round numbers to illustrate the point…the percentages are arbitrary. Consider an email that contains the phrase ‘bank account.’ If we take all emails collectively and say that 80% of them are spam and 20% are legitimate, and we say that the phrase ‘bank account’ appears in 20% of spam messages and 10% of legitimate messages, then the likelihood that an email containing the phrase ‘bank account’ is spam is eight times higher than that it is legitimate (16% versus 2%.) This will be factored in with the probabilistic analysis of other phrases in the email, any links, the source domain, or other attributes to come up with a total probability that a specific email is spam. If the probability exceeds the threshold, it is filtered. If it is below the threshold, it is passed on.

Bayesian filters need to be ‘trained’ as the attributes that can identify spam are not consistent across all organisations. You can imagine the percentage of emails sent to a bank that would include the phrase ‘bank account’ would be much higher than to another company.

Spammers try to fool Bayesian filters using several techniques. You have probably seen paragraphs of seemingly random text at the end of a spam message, or words that are broken up with nonsense characters or soft-hyphens. These are ways to game the system by either escaping detection, or throwing the total calculation off by placing words or phrases that are more likely to be found in legitimate mail than in spam.

While Bayesian filtering is an important part of most anti-spam systems, it is only one part and should be used in combination with other methods like whitelists, blacklists, and other filtering technologies. Fighting spam, just like any other security initiative, should take a layered approach, often called defense in depth.

You miss to mention one important fact about modern Bayesian Anti-Spam filters. Most Bayesian Anti-Spam filters don’t just break messages into words but use different techniques to assemble tokens out of the message they are trying to classify. So just adding some junk text to the body of a message is not capable to fool the Bayesian Anti-Spam system. Off course in the beginning when most Anti-Spam filters just used single words (aka: unigrams or n-gram with the size of 1) then adding some junk text into the message body could indeed fool the Bayesian filter. But modern Bayesian filters which are using n-gram with a size more than 1 are not so easy to fool. And beside that most Bayesian Anti-Spam filters don’t use all the data (aka: naïve) from a message to perform the classification. Most of them use algorithms (for example: Graham, Burton, etc) to just consider the most significant tokens from a message. Those algorithms significantly reduce the chance that junk word text (aka: word salad) inside a message body can fool the Bayesian Anti-Spam filter.

Another very important figure is missing too in your example. In order to compute the probability of a message for the two classes SPAM/HAM you not only need the 80% SPAM (overall true positive -> TP) and the 20% HAM (overall true negative -> TN) figures and the 20% SPAM hit and the 10% INNOCENT hit for the token ‘bank account’. You need as well the figures for missclassified messages (false positive -> FP / false negative FN) too. Without them you can not properly compute the probability.

I once wrote to a mailing list using a Bayesian based Anti-Spam filter the following text after a user asked why the software needs to keep track of TP, TN, FP and FN. Here my message that I wrote to that user:

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

You have the classical problem understanding statistical thinking. There is a example that you will find in a lot of psychological literature that demonstrates the problem most humans have with statistical thinking. The problem is known in the sociopsychological literature as the “taxi/cab problem”. Let me quickly show you the example:

————————————————

Two taxi companies are active in a city. The taxis of the company A are green, those of the company B blue. The company A places 15% of the taxis, the company B the remaining 85%. And at night it comes to an accident with hit and run. The fleeing car was a taxi. A witness states that it was a green taxi.

The court orders to examine the ability of the witnesses to be able to differentiate between green and blue taxies under night view conditions. The test result is: in 80% of the cases the witness was able to identify the correct colour and was wrong in the remaining 20% of the cases.

How high is the probability that the fleeing taxi the witness has seen at that night was a taxi (green) from company A?

————————————————

Most people would answer here spontaneous with 80%. In fact a study has shown that a majority of asked persons (among them physicians, judges and studying of elite universities) answer the question with 80%.

But the correct answer is not 80%

Allow me to explain:

The whole city has 1’000 taxies. 150 (green) belong to company A and 850 (blue) belong to company B. One of those 1’000 taxies is responsible for the accident. The witness says he saw a green taxi and we know that he is correct in 80% of the cases. That means in addition that he calls a blue taxi in 20% of the cases green. From the 850 blue taxis he will thus call (false positive) 170 green. And from the 150 green taxies he will correctly prove (true positive) 120 taxies as green. In order to calculate the probability that he actually saw a green taxi when he identifies a taxi (at night viewing conditions) as green you need to divide all correct answers (TP) of “green” with all answers (FP + TP) of “green”. Therefore the probability is: 120 / ( 170 + 120) = 0.41

The probability that a green taxi caused the accident if the witness means to have seen a green taxi is therefore less then 50%. This probability depends completely crucially on the distribution of the green and blue taxis in the city. Would there be equal amount of green and blue taxies in the city then the correct answer would indeed be 80%.

Most humans however incline to ignore the initial distribution (also apriori, origin or initial probability). Psychologists speak in this connection of “base rate neglect”.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

Excellent practical example, Stevan, that also points out something I missed, thank you!

Ed

No problem Ed. I hope you are not mad when I write corrections to your posts?

Not in the least! I appreciate the input, and if I make a mistake, I want to be called out on it. It’s nice that you find enough value in these to read them, and that you can add value like with that example above. We’re trying to develop a community on this site; regular commentors are a vital part of that.

Cheers,

Ed

I like your articles. That’s why I read them