Return-Path: X-Original-To: apmail-spamassassin-users-archive@www.apache.org Delivered-To: apmail-spamassassin-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B2F3E17C31 for ; Fri, 20 Mar 2015 08:31:04 +0000 (UTC) Received: (qmail 35385 invoked by uid 500); 20 Mar 2015 08:30:48 -0000 Delivered-To: apmail-spamassassin-users-archive@spamassassin.apache.org Received: (qmail 35356 invoked by uid 500); 20 Mar 2015 08:30:48 -0000 Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@spamassassin.apache.org Received: (qmail 35346 invoked by uid 99); 20 Mar 2015 08:30:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 08:30:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of h.reindl@thelounge.net designates 91.118.73.15 as permitted sender) Received: from [91.118.73.15] (HELO mail.thelounge.net) (91.118.73.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Mar 2015 08:30:42 +0000 Message-ID: <550BDA9C.50300@thelounge.net> Date: Fri, 20 Mar 2015 09:30:20 +0100 From: Reindl Harald Organization: the lounge interactive design User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: users@spamassassin.apache.org Subject: Re: Skipping RBL checks for internal servers References: <5509E9F4.60401@thelounge.net> <5509F226.4020509@thelounge.net> <20150318223435.408fbf33@gumby.homeunix.com> <550A02C9.6070100@thelounge.net> <20150318235420.2c65b67f@gumby.homeunix.com> <550A145F.5040403@thelounge.net> <20150319193543.267abdaf@gumby.homeunix.com> <550B2782.6050006@thelounge.net> <20150319225202.1baeff22@gumby.homeunix.com> In-Reply-To: <20150319225202.1baeff22@gumby.homeunix.com> OpenPGP: id=7F780279; url=https://arrakis.thelounge.net/gpg/h.reindl_thelounge.net.pub.txt Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="QbwjUha740B5jwCfB1HMm26wkIPhd4jjk" X-Spam-Report: ALL_TRUSTED,BAYES_00,T_RP_MATCHES_RCVD X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-3.5, tag-level=7.8, block-level=8.0 This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --QbwjUha740B5jwCfB1HMm26wkIPhd4jjk Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: quoted-printable Am 19.03.2015 um 23:52 schrieb RW: > On Thu, 19 Mar 2015 20:46:10 +0100 > Reindl Harald wrote: > >> Am 19.03.2015 um 20:35 schrieb RW: >>> On Thu, 19 Mar 2015 01:12:15 +0100 >>> Reindl Harald wrote: > >>>> >>>> the last point is easy to prove by having the old, unmodified >>>> corpus and run spamc against the cleaned bayes database and the >>>> final result is that you stop training in circles because you need >>>> a ton of classified ham messages to reduce the pision impact >>> >>> But you're testing mail that's already been trained into the >>> database. Even though you stripped the "Bayes-poison" when >>> training, you'll have left enough rare tokens from the headers and >>> elsewhere to effectively "fingerprint" that spam. It's pretty much >>> inevitable that it hits BAYES_99[9]. >> >> you didn't get what i wrote > > I think I did. > >> * i removed poision and rebuilt bayes >> * i verfied the *original* junk still containing poision aginst >> the new bayes because i am not an idiot to verify cleaned samples >> against a bayes built of the same contents > > The mail you used to train was edited from the mail you used to > test, which invalidates the result. > > When you train a spam you typically add a few dozen hapaxes to the > database, and substantially alter the probabilities of many low-count > tokens. This means that if you train and retest, the new result almost > always matches the training. the same happens in the other direction if somebody sends you a small,=20 legit mail with just a question and one of the dumb fortune-footers many = people use which was sadly part of bayes-posion that mail would get BAYES_95 or BAYES_99 just because the footer > When you train with spam that's had its "Bayes poison" removed you > still skew the result of a test with the full spam unless removing the > poison removes all of the hapaxes and low-count tokens, and that's > highly unlikely. the point is when you remove 70% of a message because it is poison in=20 form of mark twain poems and such bad jokes and *after* that test the=20 un-altered message with the poem included and it get's BAYES_99 on a=20 corpus with 30000 samples training works as expected the final result are no BAYES_50 in the whole ham-corpus which where=20 areound 2% before the cleanups which was also "testing mail that's=20 already been trained into the database" why would you want poems or cooking recipes trained as spam? --QbwjUha740B5jwCfB1HMm26wkIPhd4jjk Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iEYEARECAAYFAlUL2pwACgkQhmBjz394Annu9ACeNLt8B0ptVJL6CqMJhgtHkJBw Q9EAn0s4FGrEGwodwBeX+BYePx6O+i2U =72TH -----END PGP SIGNATURE----- --QbwjUha740B5jwCfB1HMm26wkIPhd4jjk--