Subject Re: shifting the midpoint between the average spam and average ham scores back to 5.0
Date Fri, 03 Sep 2004 16:38:50 GMT
```Joe Flowers wrote to users@spamassassin.apache.org:

>
> If the average spam score of all of my ham messages is 1.0 and the average
> spam score of all of my spam messages is 3.0, then what is the best way to
> move the average_of_ these_two_averages (2.0) back up to 5.0?
>
> The result being that I need my current average score for ham messages to be
> "4" and my current average score for spam messages to be "6". And, I need to
> do this without screwing up the relative statistics of spamassassin.

Hmm... After reading this thread, I think you *do* have a good question,
here, and that you did already get some good answers, but I'd like to

You make a valid point in that, if graphed separately, ham and spam
should show up as two separate curves on a graph. However, there *is*
overlap, and spam and ham (separately, or together) scores are *not*
normally distributed. They don't have to be to calculate the mean of the
means, but, in doing so, you're going to have a great deal of false
positives.

What you really should do is decide how many false positives you (and
your users) can live with. For us, it's 1/2000 (0.05%, one twentieth of
a percent). For this, you don't even need a spam corpus. Just collect a
good ham corpus (to get 0.05%, you need at least 2000 ham) and look at
the SA scores. Choose your threshold (or your constant modifier) to hit
on less than 1/2000 messages, and re-check regularly.

You can cross-check this with a spam corpus, if you want to balance FPs
against FNs (if you're well below your maximum FP ratio, you have some
room to play).

We get a lot less than 1/2000 FPs (usually 0), but 1/2000 is the maximum
ratio we'd allow before increasing the threshold.

- Ryan

