spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Kitchin (public/usenet)" <mkitchin.pub...@gmail.com>
Subject Re: List of "banned" words/bounce to sender
Date Mon, 09 Aug 2010 14:06:13 GMT
  On 8/9/2010 8:27 AM, Henrik K wrote:
> Nope, people constantly underestimate the power of regexes.. of course you
> can easily make bad ones, but Perl can run huge lists of simple alternations
> FAST.
>
> I downloaded a 10000 random name pack, and made a quick hack to regexify it
> with my favourite Regexp::Assemble.
>
> ------------------------------
> #!/usr/bin/perl
> use Regexp::Assemble;
> $ra = Regexp::Assemble->new;
> while (<STDIN>) {
>      chomp;
>      # Read comma separated names from stdin: Firstname,Lastname
>      ($firstname, $lastname) = split(',', lc);
>      # Firstname Lastname
>      $ra->add("$firstname $lastname");
>      # Lastname,? Firstname
>      $ra->add("$lastname,? $firstname");
>      # Print rule every 10000 names
>      # (?:^| ) instead of \b since "Kate" would hit "Mary-Kate"
>      if (++$cnt % 10000 == 0 || eof STDIN) {
> 	print 'body TEST_NAMES_'.++$idx;
>          print ' /(?:^| )'.$ra->as_string.'(?:$| )/i'."\n";
>      }
> }
> ------------------------------
> ./names.pl<  names.csv>  names.cf
>
> The resulting single 170000 byte rule did not affect SA in anyway, there was
> virtually no difference in my mass check tests. Running the regex through
> some file manually results in 80000 lines/second. This with one 3Ghz core.
> I think you can make rules/REs of MBs in size, but gains probably nothing.
>
> About ClamAV...
>
> + It would probably handle this even faster
> + Easy logging of exact signature that got hit (single name per sig)
> - It would also match any header like To: From: etc (PRETTY BAD...)
>
> I'd choose SA since it's way more flexible. I doubt performance here is a
> factor, especially with outgoing mail..
>
Thanks for the info.

- It would also match any header like To: From: etc (PRETTY BAD...)

That could be an issue. I will check to see if I can find a workaround, 
if not, ClamAV may not be an option.


Mime
View raw message