spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From (Justin Mason)
Subject Re: 3.0.5 rescoring
Date Fri, 02 Dec 2005 20:59:57 GMT
Hash: SHA1

Michael Monnerie writes:
> On Donnerstag, 1. Dezember 2005 19:21 Justin Mason wrote:
> > I think if we limit each corpora to a certain max percentage of the
> > total, we could do this -- e.g. if a corpus makes up more than (100 /
> > num_contributors)%, then any excess above that percentage is dropped,
> You don't tell me that your 700k messages are hand sorted? How old are 
> you ;-)

They really are ;)  Basically I almost never delete ham; instead I have 3
keys remapped:

    d => "delete"
    s => "delete as ham"
    a => "delete as spam"

The latter two file the messages into the corpus, instead of removing.
There's no difference in UI terms, so it's very easy to maintain a
hand-sorted corpus this way.

More MUAs should support this ;)   I requested it in KMail, but I don't
think it ever happened.

> Anyway, more contributors would help to the problem. Imagine you get 100 
> contributors, each just 2000 messages. And I believe there are a lot of 
> people out there having a bigger corpora already. Making it more easy 
> to contribute (and encourage people to report) could help.

Suggestions are welcome. We currently have two ways:

    1. rsync or "svn update" nightly, and run mass-checks on your
    mail corpora locally

    2. rsync up the corpora and let the buildbot do it.

There may be other more effective, easier ways to make this easy...
it'd be great to work one out.

> If your two corpora is so big, I guess setting a time limit to only take 
> the last 180 days or so of all SPAM could reduce your over-power in the 
> percentages.

Yep.   I think we can can do something like this (and should).

- --j.
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS


View raw message