spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Funk <dbf...@engineering.uiowa.edu>
Subject Re: Bayes, Manual and Auto Learning Strategies
Date Wed, 02 Jul 2014 08:05:25 GMT
On Wed, 2 Jul 2014, Steve Bergman wrote:

> Well... I just turned on autolearn for a moment, deleted the bayes_* files on 
> the test account I use, and sent myself a message from my usual outside 
> account. And new bayes_* files were created. So I was wrong, and I win. More 
> options.
>
> So now I can proceed to the "what does this mean?" phase.
>
> If I leave things as they are, then training is perfect if the users are 
> diligent. But if they are not, then... what? I see plenty of spams getting 
> through with a 0.0 score. IIRC, the autolearn spam threshold is 7? Pretty 
> much everything there is spam.
>
> But I'm not sure I quite buy having the static rules of SA training Bayes. 
> Isn't Bayes just learning to emulate the static rules, with all their 
> imperfections?

Unless you've explicitly disabled them, the network based rules (razor,
pyzor, dcc, DNS based rules, RBLs, URIBLs, etc) constitute an external
'reputation' system to pass judgment on messages.
It's not uncommon to take a low-scoring spam and find that it gets a
higher score on retest as it has been added to various bad-boy lists.

This is also one way that gray-listing helps. If you stiff-arm the first
pass of a spam run a later check may hit it more accurately as it's been
added to block-lists in the mean-time.


> If it starts going wrong, doesn't that mean the errors are going to spiral 
> out of control?

That is a possible risk of relying solely on auto-learning.
The autolearn system has been carefully crafted and tuned over the years
to try to prevent a feed-back loop from throwing it into a tail-spin.
For example the internal scoring system used to determine if a message
is spam or ham WRT the choice for auto-learning explicitly excludes
the Bayes score (and other particular kinds of scores such as white/black
lists) to try to prevent tail-eating.
Occasional judicious manual learning can help to 'tweak' things when Bayes
looks like it's not in top shape. (IE manual learning of FPs & FNs).

I've used site-wide Bayes with auto-learning at a site with ~3000 users
and have had to flush & restart our Bayes database twice in 10 years.

Dave

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Mime
View raw message