spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arthur Dent <sa.l...@troodos.demon.co.uk>
Subject sa-learn weirdness...
Date Wed, 06 Feb 2008 10:48:03 GMT
Well, in fairness, it's probably not sa-learn that's causing the weirdness but
my setup. I don't understand what's causing the problem however. Allow me to
explain...

I have a nightly cron job that runs a script to do sa-learning. Learning spam
is no problem, it's all in one mail folder (2 actually but details irrelevant)
and contains roughly 4,000 spam mails.

Ham is more of a problem because procmail sorts my mail into several different
(mbox) folders and I manually file incoming mail into many others. What my script
does is concatenate all these various folders into one "TempHam" folder which
is then used for sa-learn and is then deleted.

I recently had a tidy-up and reorganisation of my folders and arranged a
hierarchical folder system. In so doing I realised that for many months
(years?) I had actually been leaving many of busier folders (e.g the one I
file all these spamassassin mailinglist emails) out of the cat routine. This
was my opportunity to fix this.

I expected a one-off large spike in sa-learn for ham messages therefore for the
first night the job would run (and sure enough the ham learn job went from c.
10 minutes to 1 hour 24 minutes - causing an overlap of backup routines etc.)

I was however, surprised when the same thing happened the next night (and the
next...)

Below I list the output from the last few nights (ham only). The first entry is
the last run under the previous system.*

Learned tokens from 8 message(s) (3165 message(s) examined)
Learned tokens from 4628 message(s) (8703 message(s) examined)
Learned tokens from 3890 message(s) (8634 message(s) examined)
Learned tokens from 2264 message(s) (8671 message(s) examined)
Learned tokens from 2303 message(s) (8620 message(s) examined)

Notice that although the amount of tokens being learned seems to be coming
down gradually, the total far exceeds the total amount of ham mails in the
corpus.

Is this normal?
Will it eventually settle down?

Thanks in advance for any advice or suggestions...

Mark


* Note: this is my home system. Mails >180 days old are archived out of the
folders using archivemail. I get probably c.40-50 non-spam mails per day which are kept in
the various folders.



Mime
View raw message