spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reindl Harald <h.rei...@thelounge.net>
Subject Re: Skipping RBL checks for internal servers
Date Fri, 20 Mar 2015 08:30:20 GMT


Am 19.03.2015 um 23:52 schrieb RW:
> On Thu, 19 Mar 2015 20:46:10 +0100
> Reindl Harald wrote:
>
>> Am 19.03.2015 um 20:35 schrieb RW:
>>> On Thu, 19 Mar 2015 01:12:15 +0100
>>> Reindl Harald wrote:
>
>>>>
>>>> the last point is easy to prove by having the old, unmodified
>>>> corpus and run spamc against the cleaned bayes database and the
>>>> final result is that you stop training in circles because you need
>>>> a ton of classified ham messages to reduce the pision impact
>>>
>>> But you're testing mail that's already been trained into the
>>> database. Even though you stripped the "Bayes-poison" when
>>> training, you'll have left enough rare tokens from the headers and
>>> elsewhere to effectively "fingerprint" that spam. It's pretty much
>>> inevitable that it hits BAYES_99[9].
>>
>> you didn't get what i wrote
>
> I think  I did.
>
>> * i removed poision and rebuilt bayes
>> * i verfied the *original* junk still containing poision aginst
>>     the new bayes because i am not an idiot to verify cleaned samples
>>     against a bayes built of the same contents
>
> The mail you used to train was edited from the mail you used to
> test, which invalidates the result.
>
> When you train a spam you typically add a few dozen hapaxes to the
> database, and substantially alter the probabilities of many low-count
> tokens. This means that if you train and retest, the new result almost
> always matches the training.

the same happens in the other direction if somebody sends you a small, 
legit mail with just a question and one of the dumb fortune-footers many 
people use which was sadly part of bayes-posion

that mail would get BAYES_95 or BAYES_99 just because the footer

> When you train with spam that's had its "Bayes poison" removed you
> still skew the result of a test with the full spam unless removing the
> poison removes all of the hapaxes and low-count tokens, and that's
> highly unlikely.

the point is when you remove 70% of a message because it is poison in 
form of mark twain poems and such bad jokes and *after* that test the 
un-altered message with the poem included and it get's BAYES_99 on a 
corpus with 30000 samples training works as expected

the final result are no BAYES_50 in the whole ham-corpus which where 
areound 2% before the cleanups which was also "testing mail that's 
already been trained into the  database"

why would you want poems or cooking recipes trained as spam?



Mime
View raw message