Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm
Precedence: bulk
Received-SPF: pass (nike.apache.org: domain of
 sf-spamassassin-talk@m.gmane.org designates 80.91.229.12 as permitted sender)
To: users@spamassassin.apache.org
connect(): No such file or directory
From: NFN Smith <worldoff9908@mail.com>
Subject: Bayes implementation questions
Date: Thu, 03 Jun 2010 12:09:18 -0700
Lines: 80
Message-ID: <hu8ul2$j88$1@dough.gmane.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
X-Complaints-To: usenet@dough.gmane.org
X-Gmane-NNTP-Posting-Host: wsip-98-190-158-226.ph.ph.cox.net
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US;
 rv:1.9.1.9) Gecko/20100317 SeaMonkey/2.0.4
X-Virus-Checked: Checked by ClamAV on apache.org

After using SpamAssassin for a number of years, I'm finally getting 
around to implementing Bayesian filters.  For my particular setup, the 
bulk of my users are non-technical users who make POP connections 
(although there are some that use IMAP clients, both offline and 
webmail).  Thus, I'm wanting to apply Bayesian learning on a server-wide 
basis, rather than a per-user basis.

On this one, I do recognize the challenges of keeping both spamtraps and 
hamtraps adequately fed, and I've got a couple of ideas of how to 
facilitate that.

I have most of the basics figured out, and working in a test situation, 
but not yet sufficiently confident that I adequately have the details 
that I'm yet ready to try applying to a production server.

My current setup is that I my server is running Debian Lenny, and in 
that, I'm running sendmail 8.14.3, MIMEDefang 2.64, SpamAssassin 3.3.1 
(taken from the Debian lenny-backports branch), and cyrus-imapd 2.2.13.

In this one, I use MIMEDefang to call SpamAssassin, rather than using 
spamc/spamd, and where my SA configs are done in /etc/mail/sa-mimedefang.cf.

I've also found sa-learn-cyrus, and that appears to be working well, but 
I'm not sure if it's necessarily doing everything I need -- thus, if 
there's a different method of scanning Cyrus-format mailboxes, I'm quite 
willing to try that.

Areas of question:

1) I'm assuming that I want to run sa-learn-cyrus as the same user ID as 
is used to run SpamAssassin (mail:mail).

2) I'm struggling a bit with location of the Bayesian database.  In 
sa-mimedefang.cf, I have specified:

    bayes_path /var/spamassassin/bayes/bayes
    bayes_file_mode 0777

but it looks like sa-learn-cyrus is ignoring that, even though I have

    prefs_file = /etc/mail/sa-mimedefang.cf

included in /etc/spamassassin/sa-learn-cyrus.conf.  As a result, when I 
run sa-learn-cyrus, the Bayesian data is being located in 
~mail/.spamassassin, which on my system is /var/mail/.spamassassin.  The 
data is all there correctly, it's just not where I would choose to put 
it, but maybe that's not a problem.

3) For ongoing usage, I will offer my users who do make use of IMAP 
accounts the option of submitting spam and ham samples via learn.ham and 
learn.spam folders (as per documentation of sa-learn-cyrus).  However, 
for POP users, I haven't yet figured out a way of being able to allow 
them to be able to make occasional sample submissions.  My primary 
concern is for response on the occasions when a legitimate message gets 
scored with BAYES_99, and getting that cleared, but by the same token, I 
do want to allow for submissions of stuff that may be reaching live 
users, but not hitting my spamtraps.

4) I run several servers in parallel.  My spamtraps indicate that some 
spam operations hit user ids on two or more of my servers, while other 
ops seem to have only user addresses on a single server.  Is there a way 
of feeding the Bayesian data on one server to the other servers?

Most of the spam data I work from is in what's hitting my spamtraps, and 
with a little judicious use of rules via an IMAP client (e.g., copying 
content to learn.ham folders on accounts on each server), but I'm 
wondering if there's an easier way. From my reading of previous 
discussions, I know that sharing database files is something that is 
best avoided, so I'm thinking that a better approach is simply getting 
the message traffic copied from one server to another, and then letting 
sa-learn-cyrus learn the content on each server.  At that point, the 
question is in how to get content copied from a Cyrus mailbox on one 
server to a mailbox on another server via scripting, rather than having 
to play with an IMAP client.  But maybe that's a Cyrus-specific question.

Thanks in advance for advice.

Smith