Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (athena.apache.org: domain of h.reindl@thelounge.net
 designates 91.118.73.15 as permitted sender)
Message-ID: <550BDA9C.50300@thelounge.net>
Date: Fri, 20 Mar 2015 09:30:20 +0100
From: Reindl Harald <h.reindl@thelounge.net>
Organization: the lounge interactive design
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: users@spamassassin.apache.org
Subject: Re: Skipping RBL checks for internal servers
References: <A6180199D7CF3AD4966DE146@[192.168.1.9]>
	<5509E9F4.60401@thelounge.net>
	<alpine.LSU.2.03.1503181621020.22681@engineering.uiowa.edu>
	<5509F226.4020509@thelounge.net>	<20150318223435.408fbf33@gumby.homeunix.com>
	<550A02C9.6070100@thelounge.net>	<20150318235420.2c65b67f@gumby.homeunix.com>
	<550A145F.5040403@thelounge.net>	<20150319193543.267abdaf@gumby.homeunix.com>
	<550B2782.6050006@thelounge.net> <20150319225202.1baeff22@gumby.homeunix.com>
In-Reply-To: <20150319225202.1baeff22@gumby.homeunix.com>
OpenPGP: id=7F780279;
	url=https://arrakis.thelounge.net/gpg/h.reindl_thelounge.net.pub.txt
Content-Type: multipart/signed; micalg=pgp-sha1;
 protocol="application/pgp-signature";
 boundary="QbwjUha740B5jwCfB1HMm26wkIPhd4jjk"

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--QbwjUha740B5jwCfB1HMm26wkIPhd4jjk
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: quoted-printable


Am 19.03.2015 um 23:52 schrieb RW:
> On Thu, 19 Mar 2015 20:46:10 +0100
> Reindl Harald wrote:
>
>> Am 19.03.2015 um 20:35 schrieb RW:
>>> On Thu, 19 Mar 2015 01:12:15 +0100
>>> Reindl Harald wrote:
>
>>>>
>>>> the last point is easy to prove by having the old, unmodified
>>>> corpus and run spamc against the cleaned bayes database and the
>>>> final result is that you stop training in circles because you need
>>>> a ton of classified ham messages to reduce the pision impact
>>>
>>> But you're testing mail that's already been trained into the
>>> database. Even though you stripped the "Bayes-poison" when
>>> training, you'll have left enough rare tokens from the headers and
>>> elsewhere to effectively "fingerprint" that spam. It's pretty much
>>> inevitable that it hits BAYES_99[9].
>>
>> you didn't get what i wrote
>
> I think  I did.
>
>> * i removed poision and rebuilt bayes
>> * i verfied the *original* junk still containing poision aginst
>>     the new bayes because i am not an idiot to verify cleaned samples
>>     against a bayes built of the same contents
>
> The mail you used to train was edited from the mail you used to
> test, which invalidates the result.
>
> When you train a spam you typically add a few dozen hapaxes to the
> database, and substantially alter the probabilities of many low-count
> tokens. This means that if you train and retest, the new result almost
> always matches the training.

the same happens in the other direction if somebody sends you a small,=20
legit mail with just a question and one of the dumb fortune-footers many =

people use which was sadly part of bayes-posion

that mail would get BAYES_95 or BAYES_99 just because the footer

> When you train with spam that's had its "Bayes poison" removed you
> still skew the result of a test with the full spam unless removing the
> poison removes all of the hapaxes and low-count tokens, and that's
> highly unlikely.

the point is when you remove 70% of a message because it is poison in=20
form of mark twain poems and such bad jokes and *after* that test the=20
un-altered message with the poem included and it get's BAYES_99 on a=20
corpus with 30000 samples training works as expected

the final result are no BAYES_50 in the whole ham-corpus which where=20
areound 2% before the cleanups which was also "testing mail that's=20
already been trained into the  database"

why would you want poems or cooking recipes trained as spam?


--QbwjUha740B5jwCfB1HMm26wkIPhd4jjk
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlUL2pwACgkQhmBjz394Annu9ACeNLt8B0ptVJL6CqMJhgtHkJBw
Q9EAn0s4FGrEGwodwBeX+BYePx6O+i2U
=72TH
-----END PGP SIGNATURE-----

--QbwjUha740B5jwCfB1HMm26wkIPhd4jjk--