spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Menschel <Rob...@Menschel.net>
Subject Re[2]: NOTICE: 3.1.0 rescoring mass-checks
Date Sat, 02 Jul 2005 04:07:13 GMT
Hello Nix,

Friday, July 1, 2005, 5:00:00 PM, you wrote:

N> On Thu, 30 Jun 2005, Theo Van Dinter spake:
>> 18 months ago would be Jan 1 2004, not 2003.  We also usually limit to
>> 6 months, not 18, but ...

N> Six months isn't much for ham at all, is it? That would only give me a
N> thousand or so hams, and more than a hundred times as much spam as ham.

N> This seems a little... unbalanced. Ham doesn't change *that* fast.

N> (Maybe I should suck a few mailing lists into the ham, but I'm chary of
N> that because many of those lists may also be being used by others as
N> ham sources, so it may lead to duplication.)

I'm in a fortunate position that my corpus pulls in 20k ham and 20k
spam each week, so this isn't a concern for me at the moment.

However, previously my pattern was like yours, and when I would
mass-check, I'd mass-check on two years' ham vs 3 months' spam.

Since I wasn't mass-checking Bayes, all I did was one mass-check run
specifying only my ham corpus, and then a second mass-check run
specifying only my spam corpus.  I then combined them for the
frequency analysis.

It should be feasible to modify the rescoring mass-check instructions
so you do something like:
a) initialize the mass-check (including remove any prior Bayes
database)
b) split your ham corpus (1-2 years) into 10 equal parts. Split your
spam corpus (2-6 months) into 10 equal parts.
c) Cycle through your 20 corpus files, running mass-check on each:
oldest ham, oldest spam, next oldest ham, next oldest spam, etc.
d) Combine all ham logs into one, combine all spam logs into one.

It's not optimal, in that Bayes will be trained on emails out of time
sequence, but it should shuffle them enough to get useful results out
of it, IMO.

Bob Menschel




Mime
View raw message