spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henrik K <>
Subject Re: List of "banned" words/bounce to sender
Date Mon, 09 Aug 2010 13:27:58 GMT
On Mon, Aug 09, 2010 at 07:28:42AM -0500, Daniel McDonald wrote:
> This technique might cut down the number of rules by 93.5%, but then you
> have to do database lookups and some fancy parsing to verify the hit. 
> Don't know if that would be worth it.

Nope, people constantly underestimate the power of regexes.. of course you
can easily make bad ones, but Perl can run huge lists of simple alternations

I downloaded a 10000 random name pack, and made a quick hack to regexify it
with my favourite Regexp::Assemble.

use Regexp::Assemble;
$ra = Regexp::Assemble->new;
while (<STDIN>) {
    # Read comma separated names from stdin: Firstname,Lastname
    ($firstname, $lastname) = split(',', lc);
    # Firstname Lastname
    $ra->add("$firstname $lastname");
    # Lastname,? Firstname
    $ra->add("$lastname,? $firstname");
    # Print rule every 10000 names
    # (?:^| ) instead of \b since "Kate" would hit "Mary-Kate"
    if (++$cnt % 10000 == 0 || eof STDIN) {
	print 'body TEST_NAMES_'.++$idx;
        print ' /(?:^| )'.$ra->as_string.'(?:$| )/i'."\n";
./ < names.csv >

The resulting single 170000 byte rule did not affect SA in anyway, there was
virtually no difference in my mass check tests. Running the regex through
some file manually results in 80000 lines/second. This with one 3Ghz core.
I think you can make rules/REs of MBs in size, but gains probably nothing.

About ClamAV...

+ It would probably handle this even faster
+ Easy logging of exact signature that got hit (single name per sig)
- It would also match any header like To: From: etc (PRETTY BAD...)

I'd choose SA since it's way more flexible. I doubt performance here is a
factor, especially with outgoing mail..

View raw message