spamassassin-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <>
Subject [Spamassassin Wiki] Update of "HandClassifiedCorpora" by JustinMason
Date Sun, 26 Jun 2005 21:25:55 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:

The comment on the change is:
add note about unbalanced spam corpora

    * containing a representative mix of ham mail -- that includes commercial-sounding-but-not-spam
messages, legitimate business discussions (which may include talk of "sales", "marketing",
"offers", bankruptcies, mortgages, etc), or verified opt-in mail newsletters. This is a ''very''
important point! Your ham corpus should contain as much ham as is possible, as close to ALL
valid emails received by everybody as is possible, with only the exceptions noted here. ("as
is possible" recognizes that for privacy and confidentiality reasons some ham cannot be stored
anywhere but its destination email folder.) 
    * containing no old spam mail.  Older spam uses different tricks and terminology, which
will impact SpamAssassin's accuracy when it's filtering "live", new mail.  Please try not
to scan spam older than 6 months. For this purpose it may be useful to categorize your spam
by month, and to regularly delete those files with the older spam. 
+   * containing a representative mix of spam mail.  If you bounce high-scoring spam, or have
collections of only user submissions of missed spams or false positive hams, this will unbalance
the corpus; it's better to scan collections of ''all'' spam received at a set of email accounts,
instead of a subset.
    * cleaned of viruses, bounce mails from broken virus and spam filters, and forwarded spam
messages.  These will skew the results.

View raw message