spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Jones <>
Subject Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Date Wed, 14 Feb 2018 16:06:49 GMT
On 02/14/2018 09:20 AM, Matus UHLAR - fantomas wrote:
>> On Tue, 13 Feb 2018 21:02:46 +0000
>> Horváth Szabolcs wrote:
>>> One more question: is there a recommended ham to spam ratio? 1:1?
> On 14.02.18 15:09, RW wrote:
>> No, this is a myth.  Bayes computes token probabilities from a token's
>> frequencies in spam and ham, so it all scales through. If you have
>> 2000 ham and 200 spam the problem is too few spams, not a bad ratio.
> my experience says you will need more ham than spam, because you want to 
> get
> rid of false positives (ham marked as spam) much more than of false 
> negatives.

This is also my experience.

> what really matters is how many of FP/FNs you have, you can decrease
> probability by training anything too far from BAYES_00 for ham and BAYES_99
> for ham

Correct.  You want to get ham hitting BAYES_00 and spam hitting 
BAYES_80, BAYES_95, BAYES_99, or BAYES_999 which mine does very well.

A problem I have found is you shouldn't blindly train all spam as spam. 
I have some spam hitting BAYES_00 because it truly could be ham based on 
the body contents but it's spam because it was unsolicited email from 
someone "cold" emailing for a meeting or something.

In this case, I block the sender and report it to SpamCop and other 
abuse so the account can be blocked/locked/disabled hopefully.

If I had trained my Bayes with this email as spam, then legit email 
could hit BAYES_99.  That is why my nightly process to train my Bayes DB 
in redis learns ham first then spam second.  This seems to be the best 
order from my experience.

David Jones

View raw message