www-infrastructure-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Mason ...@jmason.org>
Subject Re: Mail Archive Spam Cleanup
Date Fri, 03 Apr 2009 08:37:08 GMT
hi -- I suggest setting "use_auto_whitelist" to 0.  it wouldn't make
much use in this case and requires file locking too.  also, if you
want to avoid locking slowdowns, turn off bayes autolearning... it
probably isn't helping enough to make it useful for the slowdown it

Another idea: use SA's "mass-check" tool:

it is nicely parallelized....


On Thu, Apr 2, 2009 at 21:20, chris <chris@ia.gov> wrote:
> Hash: SHA1
>>  2) Edit the raw mbox files themselves.  This is the hardest, as the
>> mbox files in archived format are gzipped compressed, so any tool
>> would need to uncompress, edit, recompress....  Maybe a command line
>> (python?) tool, that we could run form people.apache.org, but it would
>> be much harder to make web based.  It does have the advantage that all
>> 3rd parties would get editted archives, but I doubt many will re-read
>> the edited files.  Without care on how this is done however, this is
>> literally destroying data.
> Hi Paul,
> I have something similar to your 2) option above running right now on a local copy of
the mail archives to see how it
> does.  So far I am encouraged by the results but it is slow going even with multiple
threads. I seem to having locking
> issues with SA if I go over 5 on my machine.  When it is done I will have cleaner copies
of the the archive files that
> had spam,  a list of pruned message-id and the mbox they came from, and a mbox URL to
the message-ID for quick viewing
> in a browser.
> I'm using a bit of perl that calls the perl Spam Assassin hooks to parse the mailbox
and score it.  I'm not very
> experienced with SA so I'd be open suggestions as to what good settings would be for
performing these kinds of tests on
> large archives like the ASF's.
> My SA user_prefs are as follows and I am catching a fair bit of spam.  Casual inspection
has not revealed much of a
> problem with false positives.  Though I only reviewed a small sample.
> required_score           5.0
> report_safe             0
> use_bayes               1
> bayes_auto_learn              1
> skip_rbl_checks         1
> use_razor2              0
> use_dcc                 0
> use_pyzor               0
> ok_languages            all
> ok_locales              all
> trusted_networks
> score CTYPE_8SPACE_GIF 0
> lock_method flock
> After my full run is done I will put the results up including the cleaned and gzip'ed
files as well as the script I am
> using to do it all.
> good day!
> crr/arreyder
> Version: GnuPG v2.0.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
> iEYEARECAAYFAknVLAQACgkQPmaZdRmQd+aPEgCfajBc08ccymrj5rBQ4FpaStP3
> MtAAmgLwcYZzivaxbtVlSIEB7CDdqood
> =tpOV

View raw message