Return-Path: Delivered-To: apmail-spamassassin-users-archive@www.apache.org Received: (qmail 26457 invoked from network); 9 Aug 2010 14:06:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Aug 2010 14:06:56 -0000 Received: (qmail 95838 invoked by uid 500); 9 Aug 2010 14:06:53 -0000 Delivered-To: apmail-spamassassin-users-archive@spamassassin.apache.org Received: (qmail 94927 invoked by uid 500); 9 Aug 2010 14:06:49 -0000 Mailing-List: contact users-help@spamassassin.apache.org; run by ezmlm Precedence: bulk list-help: list-unsubscribe: List-Post: List-Id: Delivered-To: mailing list users@spamassassin.apache.org Received: (qmail 94917 invoked by uid 99); 9 Aug 2010 14:06:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Aug 2010 14:06:48 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mkitchin.public@gmail.com designates 209.85.160.178 as permitted sender) Received: from [209.85.160.178] (HELO mail-gy0-f178.google.com) (209.85.160.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Aug 2010 14:06:40 +0000 Received: by gyc15 with SMTP id 15so4965470gyc.37 for ; Mon, 09 Aug 2010 07:06:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=yLvd4M1QJQvfJzpKjAfbGoesG4CXF0R0MJRK/tlbtxw=; b=MwOyRqNYpAOxsatX/vifBLIKn2rjFSPPs6Thffh4McvossVbZQrqA2tHZxoNHX0xs0 OhplvJzi3dYPV6LaO4/mWGJBY10j8SUs1LFRQMCY6E1s6TyK8HkCPHtNErperYB48xkd UNWWlfVHgi9DdKafHvhpXoAq6lmeCQwufUeOQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=Q46dXPuauaGWbIoJ+vz8vfPqsr2anMmuEuB3mUntoNfYJyV26P1gpGPViyYkAi+fqy BUzCTnHW3PZKI3eDeTwXfMo7JlvnQBimXHT6Tu9ZbPyDV+VhsXUjatBUK27kH/UR9qKH MIyoYWz7bRRt8mP/B06n/lE9pBTgJ5rH2cMcc= Received: by 10.101.164.16 with SMTP id r16mr17864268ano.199.1281362774958; Mon, 09 Aug 2010 07:06:14 -0700 (PDT) Received: from [10.92.11.5] (66.238.243.195.ptr.us.xo.net [66.238.243.195]) by mx.google.com with ESMTPS id t30sm8632917ann.27.2010.08.09.07.06.13 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 09 Aug 2010 07:06:13 -0700 (PDT) Message-ID: <4C600B55.2030603@gmail.com> Date: Mon, 09 Aug 2010 09:06:13 -0500 From: "Matthew Kitchin (public/usenet)" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.8) Gecko/20100802 Thunderbird/3.1.2 MIME-Version: 1.0 To: users@spamassassin.apache.org Subject: Re: List of "banned" words/bounce to sender References: <1281355094.2104.30.camel@zappa.gregorie.org> <20100809132758.GA14804@smtp.hege.li> In-Reply-To: <20100809132758.GA14804@smtp.hege.li> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org On 8/9/2010 8:27 AM, Henrik K wrote: > Nope, people constantly underestimate the power of regexes.. of course you > can easily make bad ones, but Perl can run huge lists of simple alternations > FAST. > > I downloaded a 10000 random name pack, and made a quick hack to regexify it > with my favourite Regexp::Assemble. > > ------------------------------ > #!/usr/bin/perl > use Regexp::Assemble; > $ra = Regexp::Assemble->new; > while () { > chomp; > # Read comma separated names from stdin: Firstname,Lastname > ($firstname, $lastname) = split(',', lc); > # Firstname Lastname > $ra->add("$firstname $lastname"); > # Lastname,? Firstname > $ra->add("$lastname,? $firstname"); > # Print rule every 10000 names > # (?:^| ) instead of \b since "Kate" would hit "Mary-Kate" > if (++$cnt % 10000 == 0 || eof STDIN) { > print 'body TEST_NAMES_'.++$idx; > print ' /(?:^| )'.$ra->as_string.'(?:$| )/i'."\n"; > } > } > ------------------------------ > ./names.pl< names.csv> names.cf > > The resulting single 170000 byte rule did not affect SA in anyway, there was > virtually no difference in my mass check tests. Running the regex through > some file manually results in 80000 lines/second. This with one 3Ghz core. > I think you can make rules/REs of MBs in size, but gains probably nothing. > > About ClamAV... > > + It would probably handle this even faster > + Easy logging of exact signature that got hit (single name per sig) > - It would also match any header like To: From: etc (PRETTY BAD...) > > I'd choose SA since it's way more flexible. I doubt performance here is a > factor, especially with outgoing mail.. > Thanks for the info. - It would also match any header like To: From: etc (PRETTY BAD...) That could be an issue. I will check to see if I can find a workaround, if not, ClamAV may not be an option.