www-infrastructure-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Querna <p...@querna.org>
Subject Mail Archive Spam Cleanup
Date Wed, 01 Apr 2009 10:32:01 GMT

First, this is kinda a call for a volunteer, I doubt I'll have time to
really get into fixing this any time soon.

I have access to the webmaster tools for mail-archives.apache.org, and
one of the most disturbing things is that for our top 20 queries by
traffic, none of them have to do with software at the ASF.

19 are porn related, and 1 is about hacking gmail.

It would be nice to clean this up, as mail moderation is never
perfect, but keeping them up on mail-archive.apache.org forever is
less than ideal.

Part of the problem is that there is not an easy way to remove things
from the mail archives, as they are just mbox files on disk.

I have two ideas on how we could solve this:
  1) Add a feature to mod_mbox, a Message ID Blacklist file, if the
message-id is contained in the blacklist, it just 404s on the site
like it wasn't there.  This means many parts of mod_mbox need to be
modified to check this blacklist, and anyone who rsync's our mbox
files will still get the spam.  This however is likely the easiest to
manage, as we could make a tiny webapp for adding message IDs to the
blacklist.  This is also an easily reversible step, so if we
accidentally blacklist something, its easy to fix.

 2) Edit the raw mbox files themselves.  This is the hardest, as the
mbox files in archived format are gzipped compressed, so any tool
would need to uncompress, edit, recompress....  Maybe a command line
(python?) tool, that we could run form people.apache.org, but it would
be much harder to make web based.  It does have the advantage that all
3rd parties would get editted archives, but I doubt many will re-read
the edited files.  Without care on how this is done however, this is
literally destroying data.

Once we have the tool, I think the task of actually cleaning it up
becomes much easier, and one we can do on a reactionary mode.




View raw message