spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hardin <jhar...@impsec.org>
Subject Re: Spamassassin Bayes... "why give that spam that score???"
Date Thu, 25 Feb 2016 01:14:21 GMT
On Thu, 25 Feb 2016, Steve wrote:

> On 24/02/2016 22:59, John Hardin wrote:
>>  On Wed, 24 Feb 2016, Steve wrote:
>> 
>> >  I've used spamassassin for many years - on Ubuntu, using amvisd - with 
>> >  great success.  In recent months, I've been receiving several spam 
>> >  messages each day that evade the filters.
>>
>>  Can you provide samples? (e.g. three or four on Pastebin)
>
> One of each of the most common forms:
>
> http: //pastebin.com/Wk2KD1Q1
> http: //pastebin.com/QCQ9Ymw7
> http: //pastebin.com/wgkmiJLt

The second one has autolearn=yes, so I would say that autolearn is 
probably the cause of this behavior.

Note that the bayes score doesn't contribute to the autolearning decision 
to avoid positive feedback, but if there are no non-Bayes spam signs and 
the message scores lightly negative like that one does, it can be learned 
as ham. That would make any subsequent similar messages score even lower, 
possibly offsetting actual spam hits.

Subsequently training those messages as spam will offset that effect, but 
you're to a degree playing whack-a-mole that way.

I misspoke a bit when I said there are no knobs to twiddle. I forgot about 
the autolearn thresholds, but they aren't strictly part of how bayes 
itself works, they are (again) training. If you want to use autolearn, you 
might want to reduce the learn-as-ham threshold even further. View 
autolearn as a not-quite-trustworthy user making submissions, and the 
thresholds are a way to limit the effects of poor judgement. :)

> I note that they tend to come from different mail servers each time - the 
> URLs in the body tend to be unique, too.

Have you considered greylisting to give domains a chance to be added to 
URIBLs before you see them?

>> >  * The false positives all match BAYES_00 - attracting a default score of 
>> >  -1.9. BAYES_00 seems to be at the crux of the misclassification.
>> > 
>> >  Is there a way to delve into why these messages have been allocated such 
>> >  a low bayes score - while (to a human) appearing blatant, simple, spam 
>> >  on "vanilla" spam topics?  Has my bayes data been "poisoned" somehow?
>>
>>  Poisoning is less likely than mistraining.
>>  How large is your userbase and mail volume?
>
> One user - me - several email addresses.  10,000 mails per month - several 
> mailing lists where I read only a tiny fraction of the posts.

Heh. For once it's someone pretty much like me. :)

> ~ 1,500 spams (that survive mail server RBLs).  Autolearn is on - I don't 
> think about it, it is automatic. :)
>
>>  How do you train your Bayes? Autolearn? General user submissions? Trusted
>>  user submissions? Only you, from only your personal mail?
>
> Only my personal mailbox *really* matters to me.  I train from it using the 
> dovecot antispam plugin... which feeds mail I shift to/from a spam folder 
> through a pipe involving "spamc -C".

And I assume there's a similar ham folder? You need both.

>>  Do you keep base training corpora so you can wipe and retrain if it goes
>>  off the rails for some reason?
>
> (In principle) I've got multi-gigabyte-scale spam/ham corpora.  I'm yet to 
> [ever] do anything with it. :)

I have base bayes corpora of a few thousand messages each spam and ham, 
kept in aged corpora files. I add a handful to that every month, mostly on 
the spam side. SA is trained nightly from the current corpora files and I 
can retrain from from scratch from all of them if needed, but I haven't 
needed to do that yet.

>>  If all the FNs are getting BAYES_00, make sure you're (re)training them as
>>  spam.
>
> I believe I'm doing that - but it isn't easy to prove that the training 
> 'worked'.

If you look at the output from the training you'll be able to see how many 
"new" messages it learned from.

It will have an effect, in that it will remove a specific mistraining, but 
in the meantime autolearn may be making bad decisions about other 
messages.

>>  Review how you're training. If your users aren't really trustworthy you
>>  should be manually reviewing submissions.
>
> When spam  arrives in my primary inbox, I hand classify - I'm less obsessive 
> about mailing lists. Dovecot initiates training automatically when I shift 
> messages to a special spam folder.

OK, good. If you had a userbase, their judgement (or lack thereof) could 
be an issue.

>>  I feel autolearn can be problematic, particularly if things are already
>>  going off the rails.
>
> I expect Autolearn (assisted by Razor, Pyzor and DCC) has done the vast 
> majority of my training.  This year, I've hand-trained 216 false-negatives 
> and 0 false positives.

For the size of your install, I'd recommend turning off autolearn and go 
with purely hand-collected corpora. It serves me well.

>>  If you have base training corpora, review it for misclassifications (FNs),
>>  wipe and retrain.
>
> I guess I could do that... My expectation is that - if I train with the 
> corpora I can pick easily (without changing configuration) I'll get the same 
> bayes database I currently have... which will give the same scores.

No, autolearning would no longer be affecting the results, and if you *do* 
get the same FNs, you can then go through your ham corpora and look for 
other possible causes (misclassified messages, or a ham that's something 
like part of a discussion about spam so it's confusing and shouldn't be 
in the corpora at all).

> Really, I'd like to understand why my current bayes database makes the 
> classifications it does.

Basically, because of what's been trained into it as ham.

If you autolearn, you can't really review that after the fact.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Markley's Law (variant of Godwin's Law): As an online discussion
   of gun owners' rights grows longer, the probability of an ad hominem
   attack involving penis size approaches 1.
-----------------------------------------------------------------------
  65 days since the first successful real return to launch site (SpaceX)

Mime
View raw message