www-community mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: Announcing Erathostenes 1.0
Date Sun, 18 Apr 2004 13:55:48 GMT
Andrew Savory wrote:

> Hi,
> 
> On 17 Apr 2004, at 18:59, Stefano Mazzocchi wrote:
> 
>> Find out how this works here:
>>
>>  http://www.betaversion.org/~stefano/software/erathostenes/index.html
> 
> 
> Interesting! But when you say "the assumption is that you *never* delete 
> anything" ... do you mean in perpetuity? How realistic do you think this 
> is, given the ~40kb payload of most virus mails these days? Over the 
> last 6 months, I've accumulated over a gigabyte of such mail ... that's 
> a pretty high cost in disk space!
> 
> Or do you discard after retraining?

In this version of the script, if you remove an email from the spam 
folder, the spam database is untrained. This allows you to avoid the 
escalation of false positives (if you can still spot them!).

In the previous incarnation, the script was not doing this, but the 
problem was that if you had a false positive (or, much more frequent, 
you moved your ham in the spam folder by mistake and you didn't notice 
before cron called the trainer) this "pollutes" the database.

In order to be able to perform "undo" in training (this is not frequent 
but it's a nice feature) you need to save everything at all time. 
Actually, the way the script works today is that it makes a local copy 
of the email in your server, so not only you save everything, but you 
have two copies of it.

Note that it is entirely possible, in case your disk space is limited, 
to modify the script to remove binary attachments from email.

Anyway, In my case, I have 320mb of spam in the last 6 months. Disk 
space is not that big of a deal these days, especially on servers.

-- 
Stefano.


Mime
View raw message