My not so uninteresting notes

To content | To menu | To search

Thursday 4 September 2008

Fighting spam part 1: Spamtrap

Why do we need to train the filter

Bayesian filters use a statistical approach to classify emails, in order to make it works you need to train the filter at the beginning with both know spam and not spam (ham) emails so that the filter knows which events are statistically present in spam emails and which are not. This is often done by the administrator (otherwise the bayesian part is not activated in most filters) but the day to day training is not so often done and not so well which leads to reduced filter efficiently as time goes by.

But in fact it's very important that the filter stays up to date with new spam messages so that it can gather new hints of spams and stays at the top. If the filter is not usually fed continuously with new spam messages it's because the task is not so easy.

Continue reading...

Tuesday 2 September 2008

Fighting spam part 0: Introduction

I am about to write a few articles about not so bad technics to fight efficiently spam, along the past years I developped some technics to fight spam. The latest ones seems to provide a high ratio in term of efficiency it means high quantity of spam catched and almost no false positive. I started developping this for my own personnal domain and due to my current job expand and enhance this for the company where I work for.

At the beginning it was quite simple because for my personnal use, I work with thunderbird and it includes since a long time a very good spam filter which require not so much trainning before achieving a very good filter quality and so I didn't worried much about the quality of filtering done right on the server by the SPAM filter.

But, alas, thunderbird (as many other opensource project btw) is not corporate enougth and we are stuck with outlook ... The Junk filter of the latest is rather complicated and rather unusefull. So if you want to reduce the cries of the users about SPAM you have to find a good solution on the server.

The technics that I'll present are built around spamassassin and bayesian filtering, that's not revolutionnary technologies but with a fairly good (and not complicated) and quick tuning you can acheive a very good result.

It might seems unlogical (and it is a little bit) but I'll start this serie by an article on how to train automaticaly an already running spam filter based on bayesian filtering, article about how to setup it will follow but a bit later. My reason for this is that there is tons of guides on Internet on how to setup bayes in spamassassin, whereas articles on how to train it (without the help of the standard users feedback) are rare.

Part 1: setting a spamtrap