spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reindl Harald <h.rei...@thelounge.net>
Subject Re: Skipping RBL checks for internal servers
Date Thu, 19 Mar 2015 19:46:10 GMT


Am 19.03.2015 um 20:35 schrieb RW:
> On Thu, 19 Mar 2015 01:12:15 +0100
> Reindl Harald wrote:
>
>> Am 19.03.2015 um 00:54 schrieb RW:
>
>>> This is nothing to do with auto-learning. There is a difference
>>> between miss-training and training with spam that contains
>>> so-called "Bayes poison".  Bayes is best trained on what is in
>>> real-world spam, not what we would prefer that spammers put in spam
>>
>> it's the same - it is exactly the same and it is not a matter "what
>> we would prefer that spammers put in spam" but what they put
>> *additional* to it to ruin bayes and filter results
>
> They don't put it there to ruin Bayes, they don't care about FP rates,
> they put it there so their spam can take advantage of what they guess
> has been trained as ham.

no, both of it

tests over 15000 spam examples prove that after remove poision, rebuild 
bayes from the cleaned corpus and verify the original messages still 
BAYES_99 for all of them

but it affects your ham and so FP rates over the time

> I was just looking at my recent spam and Bayes-poison seems less
> common than it used to be, but these things come in cycles.

as most spam comes in cycles, hence auto expire is wrong

analyzing 15000 spam samples showing that *identical* messages sometimes 
contains poison and sometimes don't

>> the effect is visible:
>>
>> * BAYES_00 hits are more than before
>> * BAYES_50 hits for ham are less than before
>> * ANY of the cleaned messages have still BAYES_99 and most BAYES_999
>>
>> the last point is easy to prove by having the old, unmodified corpus
>> and run spamc against the cleaned bayes database and the final result
>> is that you stop training in circles because you need a ton of
>> classified ham messages to reduce the pision impact
>
>
> But you're testing mail that's already been trained into the database.
> Even though you stripped the "Bayes-poison" when training, you'll have
> left enough rare tokens from the headers and elsewhere to effectively
> "fingerprint" that spam. It's pretty much inevitable that it hits
> BAYES_99[9].

you didn't get what i wrote

* i removed poision and rebuilt bayes
* i verfied the *original* junk still containing poision aginst
   the new bayes because i am not an idiot to verify cleaned samples
   against a bayes built of the same contents

>> if you have users from all over the world speaking different
>> languages the effect of bayes poisioning get much more visible
>> because it contains random words in al sort of languages and you
>> don't have enough ham to reduce that damage
>
> It sounds like you haven't learned enough. FWIW I do learn
> "Bayes-poison" and still have >99% of ham hitting BAYES_00. The figure
> has been rising over the years.

may depend on your mailflow and some luck

>> believe it or not - my goal is to train a bayes database once and
>> have a sane system over many many years - what i read often is "spam
>> samples become outdated and so you need to restart" - no they don't,
>
> You seem to be relying on most ham hitting BAYES_00, so the rest of the
> mail can be treated very aggressively. This probably does make you less
> reliant on an up-to-date spam corpus

which is the goal: not training day for day in circles because neding 
more and more ham samples to balance out parts never should have been 
trained as spam at all


Mime
View raw message