So far I found 3 way to feed the filter with new spam messages:

  • Setup ham and spam folder for every users, fetch emails in each folder and inject them into the filter as new training email
  • Setup ham and spam email address, extract initial email from the message and inject it into the filter as new training email
  • Setup a spam trap so that all the email that go in will be injected in the filter for training.

The ham/spam folder solution

I won't talk much about this solution in this post, but this solution implies that you can easily read the email into this folder, IMAP server is then a necessity and emails have to be accessible. sa-learn-cyrus is a good script that helps you to do implement such kind of training.

The ham/spam mailbox

This solution shares lot of aspects with the previous one, it can be used even if you do not access to the raw email on server or if emails do not stay on server (ie. POP3 access). I will come back later in my series on a script to help to automate the analysis of emails sent by users. Basically you have to extract the real email and give it in input to the learning program of your filter (ie. sa-learn for spamassassin).

Problems with first 2 solutions

Each solution relies on the fact that users will classify emails, of such kind of behavior you can never be 100% sure. For misclassified spam (that is a valid message classified as spam) you can expect that you will get the information back because the user will be really annoyed that his email was wrongly classified that he will either put a copy in the dedicated directory or forward you to the appropriate mailbox.

But for the spam, people tends to just delete spam emails that are not classified (instead of forwarding them to the appropriate mailbox or moving it to the spam folder), and if they do it it won't be their priority so the training will be done later when it will not be so useful.

And last but not least problems is that users still continue to receive spam, most probably "bleeding edge" spams and so the filtering service is percepted as no so efficient which is sad !

The spam trap

The idea behind the spam trap is simple: setup a couple of mail address not used and that will not receive real emails and train the filter with those emails as new spams. One might ask which addresses should be used and it's a good question because you must be sure that the address will receive (enough) spam so that it will be useful.

At this task, the bad habits of spamers of trying any combination of letters and numbers might be of a great help to find addresses to trap spam. In deed if your mail server is well configured, it should have a list of valid recipient and if an email arrive for an invalid recipient it will (should?) be rejected by your server (if not I suggest that you start with this because it's a rather good practice).

In my case I use postfix and when an email is rejected I have the message : User unknown in local recipient table, for instance:

postfix/smtpd[32180]: NOQUEUE: reject: RCPT from unknown[85.102.177.38]:
  550 5.1.1 <64e4cf71d8d4f61c9f34e@matws.net>:
  Recipient address rejected: User unknown in local recipient table;
  from=<udp@neath-porttalbot.gov.uk>
  to=<64e4cf71d8d4f61c9f34e@matws.net>
  proto=ESMTP helo=<dsl85-102-45350.ttnet.net.tr>

So with the following command :

grep "User unknown in local recipient table" mail.log \
 | perl -ane 'm/ to=<([^@]+@[^>]+)> /; print "$1\n";'\
 | sort  | uniq -c | sort -n

I get a list like this:

     14 knfipen@matws.net
     18 2ec8fcca3ffcbb7cb960e@matws.net
     33 fixeq@matws.net
     45 lxyfxdybbh@matws.net
     48 gzxyvsf@matws.net
     65 kfiptmh@matws.net

So kfiptmh@matws.net or gzxyvsf@matws.net seems to be good target (or not so bad) for trapping spam. As we tend to receive a lot of spam once that our email is "discovered", you can expect to receive more spam shortly after starting accpeting email on a trap address which in our case is a good thing !

Now that we have identified potential addresses, you just have to instruct the mail server to send all the emails for those address to a script that will chunk them and train the filter. Here is a recipe when using postfix using the catch_spam script attached to this post

Adapt parameters

At the top of the script, adjust the parameters:

  • $host, host or ip address of the server running the spamd daemon, set to undef to desactivate this parameter (will use localhost as spamd server)
  • $user, Unix account for which the spam will be learned, set to undef to desactivate it (will use the .spamassassin directory of the user which started the script)
  • $dir_base, base directory holding new potential spam when not running in quiet mode, this directory must be writable by the user that will run the postfix service (ie. nobody)
  • $conf, a YAML configuration file (see example attached to this post) for filtering out some good email (based on regexp in header)

Create a new postfix service

We are about to create a catchspam service into postfix, this service will process all the emails caught by the trap, to do so add the following lines into /etc/postfix/master.cf:

catchspam unix  -       n       n       -       -       pipe
  user=nobody argv=/usr/local/script/catch_spam -q

Adapt:

  • argv so that it reflect the path where you installed the catch_spam script
  • user so that it reflect the Unix account with whom you learn spam (in which home directory bayes token are stored ...)

Add transport maps and tune delivery

In the /etc/postfix/main.cf add the following lines:

catchspam_destination_recipient_limit = 1
transport_maps = hash:/etc/postfix/transport

The latest line can already be in your /etc/postfix/main.cf, in this case there is no need to duplicate it.

Add the emails to valid recipient list

After settling on a list of emails that will be used for trapping spams, you need to add them to the list of valid recipient. Just add them to /etc/aliases with nobody as pair alias, ie:

# For kfiptmh@matws.net
kfiptmh: nobody
# For gzxyvsf@matws.net
gzxyvsf: nobody

Run newaliases, to make postfix aware of the change. Starting from now you are accepting emails for those trap addresses (gzxyvsf@matws.net and kfiptmh@matws.net in my example).

Create/Update /etc/postfix/transport

The latest step is to instruct postfix to forward emails for trap addresses via the postfix service created above, instead of trying to resolve the alias. We create or update /etc/postfix/transport by adding lines following the template bellow for each trap address

<trap_address> catchspam

For instance for my two trap addresses:

gzxyvsf@matws.net catchspam
kfiptmh@matws.net catchspam

Run the following command so that postfix can really use this file:

postmap /etc/postfix/transport

And finally restart postfix so that all the modification will be active in postfix.

Starting from now, the trap is working and any email send to the trap address will be automatically sent to spamassassin bayesian filter for spam training