Fighting spam part 1: Spamtrap
By mat on Thursday 4 September 2008, 19:58 - Spam - Permalink
Why do we need to train the filter
Bayesian filters use a statistical approach to classify emails, in order to make it works you need to train the filter at the beginning with both know spam and not spam (ham) emails so that the filter knows which events are statistically present in spam emails and which are not. This is often done by the administrator (otherwise the bayesian part is not activated in most filters) but the day to day training is not so often done and not so well which leads to reduced filter efficiently as time goes by.
But in fact it's very important that the filter stays up to date with new spam messages so that it can gather new hints of spams and stays at the top. If the filter is not usually fed continuously with new spam messages it's because the task is not so easy.
So far I found 3 way to feed the filter with new spam messages:
- Setup ham and spam folder for every users, fetch emails in each folder and inject them into the filter as new training email
- Setup ham and spam email address, extract initial email from the message and inject it into the filter as new training email
- Setup a spam trap so that all the email that go in will be injected in the filter for training.
The ham/spam folder solution
I won't talk much about this solution in this post, but this solution implies that you can easily read the email into this folder, IMAP server is then a necessity and emails have to be accessible. sa-learn-cyrus is a good script that helps you to do implement such kind of training.
The ham/spam mailbox
This solution shares lot of aspects with the previous one, it can be used even if you do not access to the raw email on server or if emails do not stay on server (ie. POP3 access). I will come back later in my series on a script to help to automate the analysis of emails sent by users. Basically you have to extract the real email and give it in input to the learning program of your filter (ie. sa-learn for spamassassin).
Problems with first 2 solutions
Each solution relies on the fact that users will classify emails, of such kind of behavior you can never be 100% sure. For misclassified spam (that is a valid message classified as spam) you can expect that you will get the information back because the user will be really annoyed that his email was wrongly classified that he will either put a copy in the dedicated directory or forward you to the appropriate mailbox.
But for the spam, people tends to just delete spam emails that are not classified (instead of forwarding them to the appropriate mailbox or moving it to the spam folder), and if they do it it won't be their priority so the training will be done later when it will not be so useful.
And last but not least problems is that users still continue to receive spam, most probably "bleeding edge" spams and so the filtering service is percepted as no so efficient which is sad !
The spam trap
The idea behind the spam trap is simple: setup a couple of mail address not used and that will not receive real emails and train the filter with those emails as new spams. One might ask which addresses should be used and it's a good question because you must be sure that the address will receive (enough) spam so that it will be useful.
At this task, the bad habits of spamers of trying any combination of letters and numbers might be of a great help to find addresses to trap spam. In deed if your mail server is well configured, it should have a list of valid recipient and if an email arrive for an invalid recipient it will (should?) be rejected by your server (if not I suggest that you start with this because it's a rather good practice).
In my case I use postfix and when an email is rejected I have the message : User unknown in local recipient table, for instance:
postfix/smtpd[32180]: NOQUEUE: reject: RCPT from unknown[85.102.177.38]: 550 5.1.1 <64e4cf71d8d4f61c9f34e@matws.net>: Recipient address rejected: User unknown in local recipient table; from=<udp@neath-porttalbot.gov.uk> to=<64e4cf71d8d4f61c9f34e@matws.net> proto=ESMTP helo=<dsl85-102-45350.ttnet.net.tr>
So with the following command :
grep "User unknown in local recipient table" mail.log \ | perl -ane 'm/ to=<([^@]+@[^>]+)> /; print "$1\n";'\ | sort | uniq -c | sort -n
I get a list like this:
14 knfipen@matws.net
18 2ec8fcca3ffcbb7cb960e@matws.net
33 fixeq@matws.net
45 lxyfxdybbh@matws.net
48 gzxyvsf@matws.net
65 kfiptmh@matws.net
So kfiptmh@matws.net or gzxyvsf@matws.net seems to be good target (or not so bad) for trapping spam. As we tend to receive a lot of spam once that our email is "discovered", you can expect to receive more spam shortly after starting accpeting email on a trap address which in our case is a good thing !
Now that we have identified potential addresses, you just have to instruct
the mail server to send all the emails for those address to a script that will
chunk them and train the filter. Here is a recipe when using postfix using the
catch_spam script attached to this post
Adapt parameters
At the top of the script, adjust the parameters:
- $host, host or ip address of the server running the spamd daemon, set to
undefto desactivate this parameter (will use localhost asspamdserver) - $user, Unix account for which the spam will be learned, set to
undefto desactivate it (will use the.spamassassindirectory of the user which started the script) - $dir_base, base directory holding new potential spam when not running in quiet mode, this directory must be writable by the user that will run the postfix service (ie. nobody)
- $conf, a YAML configuration file (see example attached to this post) for filtering out some good email (based on regexp in header)
Create a new postfix service
We are about to create a catchspam service into postfix, this
service will process all the emails caught by the trap, to do so add the
following lines into /etc/postfix/master.cf:
catchspam unix - n n - - pipe user=nobody argv=/usr/local/script/catch_spam -q
Adapt:
argvso that it reflect the path where you installed the catch_spam scriptuserso that it reflect the Unix account with whom you learn spam (in which home directory bayes token are stored ...)
Add transport maps and tune delivery
In the /etc/postfix/main.cf add the following lines:
catchspam_destination_recipient_limit = 1 transport_maps = hash:/etc/postfix/transport
The latest line can already be in your /etc/postfix/main.cf, in
this case there is no need to duplicate it.
Add the emails to valid recipient list
After settling on a list of emails that will be used for trapping spams, you
need to add them to the list of valid recipient. Just add them to
/etc/aliases with nobody as pair alias, ie:
# For kfiptmh@matws.net kfiptmh: nobody # For gzxyvsf@matws.net gzxyvsf: nobody
Run newaliases, to make postfix aware of the change. Starting from now you are accepting emails for those trap addresses (gzxyvsf@matws.net and kfiptmh@matws.net in my example).
Create/Update /etc/postfix/transport
The latest step is to instruct postfix to forward emails for trap addresses
via the postfix service created above, instead of trying to resolve the alias.
We create or update /etc/postfix/transport by adding lines
following the template bellow for each trap address
<trap_address> catchspam
For instance for my two trap addresses:
gzxyvsf@matws.net catchspam kfiptmh@matws.net catchspam
Run the following command so that postfix can really use this file:
postmap /etc/postfix/transport
And finally restart postfix so that all the modification will be active in postfix.
Starting from now, the trap is working and any email send to the trap address will be automatically sent to spamassassin bayesian filter for spam training