spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Jones <djo...@ena.com>
Subject Re: Train SA with e-mails 100% proven spams and next time it should be marked as spam
Date Wed, 14 Feb 2018 16:06:49 GMT
On 02/14/2018 09:20 AM, Matus UHLAR - fantomas wrote:
>> On Tue, 13 Feb 2018 21:02:46 +0000
>> Horváth Szabolcs wrote:
>>> One more question: is there a recommended ham to spam ratio? 1:1?
> 
> On 14.02.18 15:09, RW wrote:
>> No, this is a myth.  Bayes computes token probabilities from a token's
>> frequencies in spam and ham, so it all scales through. If you have
>> 2000 ham and 200 spam the problem is too few spams, not a bad ratio.
> 
> my experience says you will need more ham than spam, because you want to 
> get
> rid of false positives (ham marked as spam) much more than of false 
> negatives.
> 

This is also my experience.


> what really matters is how many of FP/FNs you have, you can decrease
> probability by training anything too far from BAYES_00 for ham and BAYES_99
> for ham

Correct.  You want to get ham hitting BAYES_00 and spam hitting 
BAYES_80, BAYES_95, BAYES_99, or BAYES_999 which mine does very well.

A problem I have found is you shouldn't blindly train all spam as spam. 
I have some spam hitting BAYES_00 because it truly could be ham based on 
the body contents but it's spam because it was unsolicited email from 
someone "cold" emailing for a meeting or something.

In this case, I block the sender and report it to SpamCop and other 
abuse so the account can be blocked/locked/disabled hopefully.

If I had trained my Bayes with this email as spam, then legit email 
could hit BAYES_99.  That is why my nightly process to train my Bayes DB 
in redis learns ham first then spam second.  This seems to be the best 
order from my experience.

-- 
David Jones

Mime
View raw message