Science  People  Locations  Timeline
Index: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Home > Bayesian filtering


 

Bayesian filtering is the process of using Bayesian statistical methods to classify text documents into one of several categories.

Bayesian filtering gained currency when it was described in the paper "A Plan for Spam"[1] by Paul Graham, and has become popular as a mechanism to distinguish spam emails from desirable emails. Many modern mail programs such as Mozilla Thunderbird implement Bayesian spam filtering.

Bayesian filters rely on the fact that particular words have different likelihoods of occurring across different categories. For instance, most email users will seldom see the word " Viagra" in legitimate email, but will encounter it frequently in spam email. To "train" the filter, the user must manually indicate into which category a particular document belongs, and the filter will then assign a probability to each word in the email.

This probability indicates the likelihood that, in the absence of any other evidence, the document belongs in a particular category. For instance, most spam filter users will end up assigning a very high spam probability to the words "Viagra" and "Refinance", but a very high not-spam probability to words they only see in legitimate emails, such as the names of friends and family members. When all of the evidence is taken together and a final spam probability is computed, the filter will mark the email as spam if it is considered extremely likely to be such.

The advantage of Bayesian spam filtering is that it can be trained on a user-by-user basis. The spam a user receives often has some relevance (and therefore statistical clustering), as for instance placing a personal ad may increase the likelihood of receiving personal-ad-related spam. The legitimate email a user receives will also tend to have a significant amount of statistical clustering, as many of a person's coworkers, friends, and family members will choose to discuss related subjects, and therefore use similar words. Because these two sets of words are unique for each user, Bayesian spam filtering can potentially offer greater filtering accuracy.

While Bayesian filtering is most often used to identify spam, the technique can potentially be applied to classify any sort of document.

There are many good spam filters available. One of the most popular is POPFile. This software is trained to differentiate between spam and legitimate mail and classify them accordingly.

Spam filtering

Read more »

Non User