Return-Path: Delivered-To: apmail-spamassassin-users-archive@www.apache.org Received: (qmail 21037 invoked from network); 3 Jun 2010 21:15:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Jun 2010 21:15:39 -0000 Received: (qmail 62461 invoked by uid 500); 3 Jun 2010 21:15:36 -0000 Delivered-To: apmail-spamassassin-users-archive@spamassassin.apache.org Received: (qmail 62426 invoked by uid 500); 3 Jun 2010 21:15:36 -0000 Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@spamassassin.apache.org Received: (qmail 62419 invoked by uid 99); 3 Jun 2010 21:15:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 21:15:36 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_HELO_PASS,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sf-spamassassin-talk@m.gmane.org designates 80.91.229.12 as permitted sender) Received: from [80.91.229.12] (HELO lo.gmane.org) (80.91.229.12) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Jun 2010 21:15:27 +0000 Received: from list by lo.gmane.org with local (Exim 4.69) (envelope-from ) id 1OKHkm-0004Uo-DZ for users@spamassassin.apache.org; Thu, 03 Jun 2010 23:15:04 +0200 Received: from wsip-98-190-158-226.ph.ph.cox.net ([98.190.158.226]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 03 Jun 2010 23:15:04 +0200 Received: from worldoff9908 by wsip-98-190-158-226.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Thu, 03 Jun 2010 23:15:04 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: users@spamassassin.apache.org connect(): No such file or directory From: NFN Smith Subject: Bayes implementation questions Date: Thu, 03 Jun 2010 12:09:18 -0700 Lines: 80 Message-ID: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Complaints-To: usenet@dough.gmane.org X-Gmane-NNTP-Posting-Host: wsip-98-190-158-226.ph.ph.cox.net User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100317 SeaMonkey/2.0.4 X-Virus-Checked: Checked by ClamAV on apache.org After using SpamAssassin for a number of years, I'm finally getting around to implementing Bayesian filters. For my particular setup, the bulk of my users are non-technical users who make POP connections (although there are some that use IMAP clients, both offline and webmail). Thus, I'm wanting to apply Bayesian learning on a server-wide basis, rather than a per-user basis. On this one, I do recognize the challenges of keeping both spamtraps and hamtraps adequately fed, and I've got a couple of ideas of how to facilitate that. I have most of the basics figured out, and working in a test situation, but not yet sufficiently confident that I adequately have the details that I'm yet ready to try applying to a production server. My current setup is that I my server is running Debian Lenny, and in that, I'm running sendmail 8.14.3, MIMEDefang 2.64, SpamAssassin 3.3.1 (taken from the Debian lenny-backports branch), and cyrus-imapd 2.2.13. In this one, I use MIMEDefang to call SpamAssassin, rather than using spamc/spamd, and where my SA configs are done in /etc/mail/sa-mimedefang.cf. I've also found sa-learn-cyrus, and that appears to be working well, but I'm not sure if it's necessarily doing everything I need -- thus, if there's a different method of scanning Cyrus-format mailboxes, I'm quite willing to try that. Areas of question: 1) I'm assuming that I want to run sa-learn-cyrus as the same user ID as is used to run SpamAssassin (mail:mail). 2) I'm struggling a bit with location of the Bayesian database. In sa-mimedefang.cf, I have specified: bayes_path /var/spamassassin/bayes/bayes bayes_file_mode 0777 but it looks like sa-learn-cyrus is ignoring that, even though I have prefs_file = /etc/mail/sa-mimedefang.cf included in /etc/spamassassin/sa-learn-cyrus.conf. As a result, when I run sa-learn-cyrus, the Bayesian data is being located in ~mail/.spamassassin, which on my system is /var/mail/.spamassassin. The data is all there correctly, it's just not where I would choose to put it, but maybe that's not a problem. 3) For ongoing usage, I will offer my users who do make use of IMAP accounts the option of submitting spam and ham samples via learn.ham and learn.spam folders (as per documentation of sa-learn-cyrus). However, for POP users, I haven't yet figured out a way of being able to allow them to be able to make occasional sample submissions. My primary concern is for response on the occasions when a legitimate message gets scored with BAYES_99, and getting that cleared, but by the same token, I do want to allow for submissions of stuff that may be reaching live users, but not hitting my spamtraps. 4) I run several servers in parallel. My spamtraps indicate that some spam operations hit user ids on two or more of my servers, while other ops seem to have only user addresses on a single server. Is there a way of feeding the Bayesian data on one server to the other servers? Most of the spam data I work from is in what's hitting my spamtraps, and with a little judicious use of rules via an IMAP client (e.g., copying content to learn.ham folders on accounts on each server), but I'm wondering if there's an easier way. From my reading of previous discussions, I know that sharing database files is something that is best avoided, so I'm thinking that a better approach is simply getting the message traffic copied from one server to another, and then letting sa-learn-cyrus learn the content on each server. At that point, the question is in how to get content copied from a Cyrus mailbox on one server to a mailbox on another server via scripting, rather than having to play with an IMAP client. But maybe that's a Cyrus-specific question. Thanks in advance for advice. Smith