www-infrastructure-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From chris <ch...@ia.gov>
Subject Re: Mail Archive Spam Cleanup
Date Thu, 02 Apr 2009 21:20:04 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>  2) Edit the raw mbox files themselves.  This is the hardest, as the
> mbox files in archived format are gzipped compressed, so any tool
> would need to uncompress, edit, recompress....  Maybe a command line
> (python?) tool, that we could run form people.apache.org, but it would
> be much harder to make web based.  It does have the advantage that all
> 3rd parties would get editted archives, but I doubt many will re-read
> the edited files.  Without care on how this is done however, this is
> literally destroying data.

Hi Paul,

I have something similar to your 2) option above running right now on a local copy of the
mail archives to see how it
does.  So far I am encouraged by the results but it is slow going even with multiple threads.
I seem to having locking
issues with SA if I go over 5 on my machine.  When it is done I will have cleaner copies of
the the archive files that
had spam,  a list of pruned message-id and the mbox they came from, and a mbox URL to the
message-ID for quick viewing
in a browser.

I'm using a bit of perl that calls the perl Spam Assassin hooks to parse the mailbox and score
it.  I'm not very
experienced with SA so I'd be open suggestions as to what good settings would be for performing
these kinds of tests on
large archives like the ASF's.

My SA user_prefs are as follows and I am catching a fair bit of spam.  Casual inspection has
not revealed much of a
problem with false positives.  Though I only reviewed a small sample.

required_score           5.0
report_safe             0
use_bayes               1
bayes_auto_learn              1
skip_rbl_checks         1
use_razor2              0
use_dcc                 0
use_pyzor               0
ok_languages            all
ok_locales              all
trusted_networks        140.211.11.0/24
score CTYPE_8SPACE_GIF 0
score TVD_FW_GRAPHIC_NAME_LONG 0
score TVD_FW_GRAPHIC_NAME_MID 0
lock_method flock


After my full run is done I will put the results up including the cleaned and gzip'ed files
as well as the script I am
using to do it all.

good day!
crr/arreyder



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknVLAQACgkQPmaZdRmQd+aPEgCfajBc08ccymrj5rBQ4FpaStP3
MtAAmgLwcYZzivaxbtVlSIEB7CDdqood
=tpOV
-----END PGP SIGNATURE-----

Mime
View raw message