Return-Path: X-Original-To: apmail-spamassassin-users-archive@www.apache.org Delivered-To: apmail-spamassassin-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9FB23EF50 for ; Tue, 15 Jan 2013 18:55:49 +0000 (UTC) Received: (qmail 90852 invoked by uid 500); 15 Jan 2013 18:55:47 -0000 Delivered-To: apmail-spamassassin-users-archive@spamassassin.apache.org Received: (qmail 90823 invoked by uid 500); 15 Jan 2013 18:55:47 -0000 Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@spamassassin.apache.org Received: (qmail 90814 invoked by uid 99); 15 Jan 2013 18:55:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jan 2013 18:55:46 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jhardin@impsec.org designates 207.210.83.140 as permitted sender) Received: from [207.210.83.140] (HELO ga.impsec.org) (207.210.83.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 15 Jan 2013 18:55:40 +0000 Received: from athena.impsec.org (localhost [127.0.0.1]) by ga.impsec.org (8.13.7/8.13.7) with ESMTP id r0FItIvX025265 for ; Tue, 15 Jan 2013 10:55:18 -0800 Received: from athena.impsec.org (tunnel.impsec.org [127.0.0.1]) by athena.impsec.org (8.14.4/8.14.4) with ESMTP id r0FItHD1009970 for ; Tue, 15 Jan 2013 10:55:17 -0800 Received: from localhost (jhardin@localhost) by athena.impsec.org (8.14.4/8.14.4/Submit) with ESMTP id r0FItHQc009964 for ; Tue, 15 Jan 2013 10:55:17 -0800 X-Authentication-Warning: athena.impsec.org: jhardin owned process doing -bs Date: Tue, 15 Jan 2013 10:55:17 -0800 (PST) From: John Hardin To: users@spamassassin.apache.org Subject: Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new In-Reply-To: <50F58343.2000101@indietorrent.org> Message-ID: References: <50EDEBAD.2030104@indietorrent.org> <20130109223630.47bf7d51@gumby.homeunix.com> <50EE053B.7000107@indietorrent.org> <201301100136.27445.mewolf1@gmx.net> <50EE2030.6020604@indietorrent.org> <50EEEFC0.9050706@indietorrent.org> <20130110164944.26e84494@gumby.homeunix.com> <50EEF7D3.7080003@indietorrent.org> <50EEFED7.9090800@indietorrent.org> <20130110180604.5bce777e@gumby.homeunix.com> <50EF0EBE.9000401@indietorrent.org> <50EF2101.7000005@whyscream.net> <50F083DC.3040808@indietorrent.org> <50F44D77.6080209@indietorrent.org> <20130114194906.23094021@gumby.homeunix.com> <50F471C5.7090800@indietorrent.org> <50F58343.2000101@indietorrent.org> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Virus-Checked: Checked by ClamAV on apache.org On Tue, 15 Jan 2013, Ben Johnson wrote: > On 1/14/2013 8:16 PM, John Hardin wrote: >> On Mon, 14 Jan 2013, Ben Johnson wrote: >> >> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or >> are they all performed by SA? > > In postfix's main.cf: > > smtpd_recipient_restrictions = permit_mynetworks, > permit_sasl_authenticated, check_recipient_access > mysql:/etc/postfix/mysql-virtual_recipient.cf, > reject_unauth_destination, reject_rbl_client bl.spamcop.net > > Do you recommend something more? Unfortunately I have no experience administering Postfix. Perhaps one of the other listies can help. >> http://www.greylisting.org/ > > Hmm, very interesting. No, I have no greylisting in place as yet, and > no, my userbase doesn't demand immediate delivery. I will look into > greylisting further. One other thing you might try is publishing an SPF record for your domain. There is anecdotal evidence that this reduces the raw spam volume to that domain a bit. > Given this information, it concerns me that Bayes scores hardly seem to > budge when I feed sa-learn nearly identical messages 3+ times. We'll get > into that below. > >>> If so, then I guess the only remedy here is to focus on why Bayes seems >>> to perform so miserably. >> >> Agreed. >> >>> It must be a configuration issue, because I've sa-learn-ed messages >>> that are incredibly similar for two days now and not only do their >>> Bayes scores not change significantly, but sometimes they decrease. >>> And I have a hard time believing that one of my users is sa-train-ing >>> these messages as ham and negating my efforts. >> >> This is why you retain your Bayes training corpora: so that if Bayes >> goes off the rails you can review your corpora for misclassifications, >> wipe and retrain. Do you have your training corpora? Or do you discard >> messages once you've trained them? > > I had the good sense to retain the corpora. Yay! >> _Do_ you allow your users to train Bayes? Do they do so unsupervised or >> do you review their submissions? And if the process is automated, do you >> retain what they have provided for training so that you can go back >> later and do a troubleshooting review? > > Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in. > They do so unsupervised. Why this could be a problem is obvious. And no, > I don't retain their submissions. I probably should. I wonder if I can > make a few slight modifications to the shell script that Antispam calls, > such that it simply sends a copy of the message to an administrator > rather than calling sa-learn on the message. That would be a very good idea if the number of users doing training is small. At the very least, the messages should be captured to a permanent corpus mailbox. Do your users also train ham? Are the procedures similar enough that your users could become easily confused? >> Do you have autolearn turned on? My opinion is that autolearn is only >> appropriate for a large and very diverse userbase where a sufficiently >> "common" corpus of ham can't be manually collected. but then, I don't >> admin a Really Large Install, so YMMV. > > No, I was sure to disable autolearn after the last Bayes fiasco. :) OK. >> Do you use per-user or sitewide Bayes? If per-user, then you need to >> make sure that you're training Bayes as the same user that the MTA is >> running SA as. > > Site-wide. And I have hard-coded the username in the SA configuration to > prevent confusion in this regard: > > bayes_sql_override_username amavis > >> What user does your MTA run SA as? What user do you train Bayes as? > > The MTA should pass scanning off to "amavis". I train the DB in two > ways: via Dovecot Antispam and by calling sa-learn on my training > mailbox. Given that I have hard-coded the username, the output of > "sa-learn --dump magic" is the same whether I issue the command under my > own account or "su" to the "amavis" user. OK, good. >>> I have ensured that the spam token count increases when I train these >>> messages. That said, I do notice that the token count does not *always* >>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1 >>> message(s) examined)". Does this mean that all tokens from these >>> messages have already been learned, thereby making it pointless to >>> continue feeding them to sa-learn? >> >> No, it means that Message-ID has been learned from before. > > I see. So, when this happens, it means that one of my users has already > dragged the message from Inbox to Junk (which triggers the Antispam > plug-in and feeds the message to sa-learn). Very likely. The extremely odd thing is that you say you sometimes train a message as spam, and its Bayes score goes *down*. Are you training a message and then running it torough spamc to see if the score changed, or is this about _similar_ messages rather than _that_ message? > When this scenario occurs, my efforts in feeding the same message to > sa-learn are wasted, right? Bayes doesn't "learn more" from the message > the second time, or increase it's tokens' "weight", right? It would be > nice if I could eliminate this duplicate effort. Correct, no new information is learned. > Based on my responses, what's the next move? Backup the Bayes DB, wipe > it, and feed my corpus through the ol' chipper? That, and configure the user-based training to at the very least capture what they submit to a corpus so you can review it. Whether you do that review pre-training or post-bayes-is-insane is up to you. -- John Hardin KA7OHZ http://www.impsec.org/~jhardin/ jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 ----------------------------------------------------------------------- The difference is that Unix has had thirty years of technical types demanding basic functionality of it. And the Macintosh has had fifteen years of interface fascist users shaping its progress. Windows has the hairpin turns of the Microsoft marketing machine and that's all. -- Red Drag Diva ----------------------------------------------------------------------- 2 days until Benjamin Franklin's 307th Birthday