www-infrastructure-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: Mail Archive Spam Cleanup
Date Wed, 01 Apr 2009 12:57:25 GMT
On 01/04/2009, Paul Querna <paul@querna.org> wrote:
> Hi,
>  First, this is kinda a call for a volunteer, I doubt I'll have time to
>  really get into fixing this any time soon.
>  I have access to the webmaster tools for mail-archives.apache.org, and
>  one of the most disturbing things is that for our top 20 queries by
>  traffic, none of them have to do with software at the ASF.
>  19 are porn related, and 1 is about hacking gmail.
>  It would be nice to clean this up, as mail moderation is never
>  perfect, but keeping them up on mail-archive.apache.org forever is
>  less than ideal.

And subscribed/allowed users may generate spam, e.g. if their system
is compromised.

>  Part of the problem is that there is not an easy way to remove things
>  from the mail archives, as they are just mbox files on disk.
>  I have two ideas on how we could solve this:
>   1) Add a feature to mod_mbox, a Message ID Blacklist file, if the
>  message-id is contained in the blacklist, it just 404s on the site
>  like it wasn't there.  This means many parts of mod_mbox need to be
>  modified to check this blacklist, and anyone who rsync's our mbox
>  files will still get the spam.  This however is likely the easiest to
>  manage, as we could make a tiny webapp for adding message IDs to the
>  blacklist.  This is also an easily reversible step, so if we
>  accidentally blacklist something, its easy to fix.
>   2) Edit the raw mbox files themselves.  This is the hardest, as the
>  mbox files in archived format are gzipped compressed, so any tool
>  would need to uncompress, edit, recompress....  Maybe a command line
>  (python?) tool, that we could run form people.apache.org, but it would
>  be much harder to make web based.  It does have the advantage that all
>  3rd parties would get editted archives, but I doubt many will re-read
>  the edited files.  Without care on how this is done however, this is
>  literally destroying data.

I suggest that the tool should split the files into ham and spam; no
need to delete the spam, as very little extra space will be used
compared with keeping the original. This would allow recovery of
mesages if required.

The tool could also use message ids to drive the split.

Again, it would be nice if there was a web-app that one could use to
browse messages and mark them as spam.

Keeping ham and spam might be useful in providing data for the
SpamAssassin project?
But maybe they have already processed all the ASF mail.

>  Once we have the tool, I think the task of actually cleaning it up
>  becomes much easier, and one we can do on a reactionary mode.
>  Thoughts?
>  Thanks,
>  Paul

View raw message