Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of jhardin@impsec.org designates
 207.210.83.140 as permitted sender)
Date: Tue, 15 Jan 2013 10:55:17 -0800 (PST)
From: John Hardin <jhardin@impsec.org>
To: users@spamassassin.apache.org
Subject: Re: Calling spamassassin directly yields very different results than
 calling spamassassin via amavis-new
In-Reply-To: <50F58343.2000101@indietorrent.org>
Message-ID: <alpine.LNX.2.00.1301151038460.8211@athena.impsec.org>
References: <50EDEBAD.2030104@indietorrent.org>
 <20130109223630.47bf7d51@gumby.homeunix.com>
 <50EE053B.7000107@indietorrent.org> <201301100136.27445.mewolf1@gmx.net>
 <50EE2030.6020604@indietorrent.org>
 <alpine.LNX.2.00.1301091812350.13134@athena.impsec.org>
 <50EEEFC0.9050706@indietorrent.org>
 <20130110164944.26e84494@gumby.homeunix.com>
 <50EEF7D3.7080003@indietorrent.org> <50EEFED7.9090800@indietorrent.org>
 <20130110180604.5bce777e@gumby.homeunix.com>
 <50EF0EBE.9000401@indietorrent.org> <50EF2101.7000005@whyscream.net>
 <50F083DC.3040808@indietorrent.org> <50F44D77.6080209@indietorrent.org>
 <20130114194906.23094021@gumby.homeunix.com>
 <50F471C5.7090800@indietorrent.org>
 <alpine.LNX.2.00.1301141647120.18624@athena.impsec.org>
 <50F58343.2000101@indietorrent.org>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed

On Tue, 15 Jan 2013, Ben Johnson wrote:

> On 1/14/2013 8:16 PM, John Hardin wrote:
>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>
>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>> are they all performed by SA?
>
> In postfix's main.cf:
>
> smtpd_recipient_restrictions = permit_mynetworks,
> permit_sasl_authenticated, check_recipient_access
> mysql:/etc/postfix/mysql-virtual_recipient.cf,
> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>
> Do you recommend something more?

Unfortunately I have no experience administering Postfix. Perhaps one of 
the other listies can help.

>>   http://www.greylisting.org/
>
> Hmm, very interesting. No, I have no greylisting in place as yet, and
> no, my userbase doesn't demand immediate delivery. I will look into
> greylisting further.

One other thing you might try is publishing an SPF record for your domain. 
There is anecdotal evidence that this reduces the raw spam volume to that 
domain a bit.

> Given this information, it concerns me that Bayes scores hardly seem to 
> budge when I feed sa-learn nearly identical messages 3+ times. We'll get 
> into that below.
>
>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>> to perform so miserably.
>>
>> Agreed.
>>
>>> It must be a configuration issue, because I've sa-learn-ed messages
>>> that are incredibly similar for two days now and not only do their
>>> Bayes scores not change significantly, but sometimes they decrease.
>>> And I have a hard time believing that one of my users is sa-train-ing
>>> these messages as ham and negating my efforts.
>>
>> This is why you retain your Bayes training corpora: so that if Bayes
>> goes off the rails you can review your corpora for misclassifications,
>> wipe and retrain. Do you have your training corpora? Or do you discard
>> messages once you've trained them?
>
> I had the good sense to retain the corpora.

Yay!

>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>> do you review their submissions? And if the process is automated, do you
>> retain what they have provided for training so that you can go back
>> later and do a troubleshooting review?
>
> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
> They do so unsupervised. Why this could be a problem is obvious. And no,
> I don't retain their submissions. I probably should. I wonder if I can
> make a few slight modifications to the shell script that Antispam calls,
> such that it simply sends a copy of the message to an administrator
> rather than calling sa-learn on the message.

That would be a very good idea if the number of users doing training is 
small. At the very least, the messages should be captured to a permanent 
corpus mailbox.

Do your users also train ham? Are the procedures similar enough that your 
users could become easily confused?

>> Do you have autolearn turned on? My opinion is that autolearn is only
>> appropriate for a large and very diverse userbase where a sufficiently
>> "common" corpus of ham can't be manually collected. but then, I don't
>> admin a Really Large Install, so YMMV.
>
> No, I was sure to disable autolearn after the last Bayes fiasco. :)

OK.

>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>> make sure that you're training Bayes as the same user that the MTA is
>> running SA as.
>
> Site-wide. And I have hard-coded the username in the SA configuration to
> prevent confusion in this regard:
>
> bayes_sql_override_username amavis
>
>> What user does your MTA run SA as? What user do you train Bayes as?
>
> The MTA should pass scanning off to "amavis". I train the DB in two
> ways: via Dovecot Antispam and by calling sa-learn on my training
> mailbox. Given that I have hard-coded the username, the output of
> "sa-learn --dump magic" is the same whether I issue the command under my
> own account or "su" to the "amavis" user.

OK, good.

>>> I have ensured that the spam token count increases when I train these
>>> messages. That said, I do notice that the token count does not *always*
>>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>>> message(s) examined)". Does this mean that all tokens from these
>>> messages have already been learned, thereby making it pointless to
>>> continue feeding them to sa-learn?
>>
>> No, it means that Message-ID has been learned from before.
>
> I see. So, when this happens, it means that one of my users has already
> dragged the message from Inbox to Junk (which triggers the Antispam
> plug-in and feeds the message to sa-learn).

Very likely.

The extremely odd thing is that you say you sometimes train a message as 
spam, and its Bayes score goes *down*. Are you training a message and 
then running it torough spamc to see if the score changed, or is this 
about _similar_ messages rather than _that_ message?

> When this scenario occurs, my efforts in feeding the same message to
> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
> the second time, or increase it's tokens' "weight", right? It would be
> nice if I could eliminate this duplicate effort.

Correct, no new information is learned.

> Based on my responses, what's the next move? Backup the Bayes DB, wipe
> it, and feed my corpus through the ol' chipper?

That, and configure the user-based training to at the very least capture 
what they submit to a corpus so you can review it. Whether you do that 
review pre-training or post-bayes-is-insane is up to you.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   The difference is that Unix has had thirty years of technical
   types demanding basic functionality of it. And the Macintosh has
   had fifteen years of interface fascist users shaping its progress.
   Windows has the hairpin turns of the Microsoft marketing machine
   and that's all.                                    -- Red Drag Diva
-----------------------------------------------------------------------
  2 days until Benjamin Franklin's 307th Birthday