spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Thompson <r...@sasknow.com>
Subject Re: [SURBL-Discuss] Setting SpamAssassin scores for SURBL lists
Date Sun, 05 Sep 2004 17:32:57 GMT
Jeff Chan wrote to SURBL Discuss and SpamAssassin Users:

> Basically the higher the FP rate, the less useful a list is.

... or, rather, the lower it ought to be scored.

> Does anyone have other corpus stats to share, in particular
> FP rates?


Sure. All of these messages were received in the past 10 days. A lot has
happened since June. :-)

WS: 44004/54185s, 61/19150s

  OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
    73335    54185    19150    0.739   0.00    0.00  (all messages)
  100.000  73.8870  26.1130    0.739   0.00    0.00  (all messages as %)
   60.087  81.2107   0.0836    0.999   0.00    0.00  WS_SURBL

HOWEVER... I decided to go through the ham hits (61 of them), and look
for false positive domains to submit. I found several, but, for the most
part, they've *already* been cleaned up and are no longer listed in WS.
(30 out of the 61 were in a massive mailing list thread for a single
domain that has since been whitelisted).

And, in that 19K ham corpus, I found the following FPs still listed
in WS:

buckeye-express.com   -- Used in a personal email address, looks legit;
 		         7 examples
nm.ru		      -- Used in a personal email address, looks legit
advanstar.com	      -- Legit uses; found in a well-known dental
 			 newsletter; also personal email address of
 			 one of the editors; 3 messages
00fun.com	      -- Confirmed, more than one user on our system
                          sent or received eCards from them
northstarconferences.com Legit conference host site subscribed to
 			 by two users; 9 messages in this corpus
mardox.com	      -- Search engine; registered 1875 days ago, and
                          *looks* like the user did actually submit
 			 their site to them.
postsnet.com	      -- Registered exactly one year ago, 51 NANAS,
 			 blank home page, ehh... but I have 4
 			 different legit newsletters with links to
 			 them.
webspawner.com	      -- Created in 1996; free host/email
npdor.com	      -- Surveys; been around since 1999. 103 NANAS,
 			 but they've been advertised by some reputable
 			 "word of the day" mailers (dictionary.com)
 			 Maybe a good candidate for UC. :-) 2
 			 examples
imninc.com	      -- Domain is 507 days old; they do newsletters.
 			 At least one of them is legit. :-)
worldhealth.net	      -- It's 3468 days old today (1995). One of our
 			 users attended a conference of theirs, and
 			 signed up for a newsletter.
hoteldiscounts.com    -- 2459 days old (1997), found in actual room
                          booking confirmations for Comfort Inn.

(I'll re-post these in another thread, just so everybody sees them).

AND, I found 2 spams that were incorrectly hand-classified as ham.

So, if I take those out, the numbers look more like:

WS: 44006/54187s, 0/19148s

  OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
    73335    54187    19148    0.739   0.00    0.00  (all messages)
  100.007  73.8897  26.1103    0.739   0.00    0.00
   60.087  81.2111   0.0000    1.000   0.00    0.00  WS_SURBL

Is that more like what you had in mind..? No, I'm not making that up.
:-)

Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
hand-check.

- Ryan

-- 
   Ryan Thompson <ryan@sasknow.com>

   SaskNow Technologies - http://www.sasknow.com
   901-1st Avenue North - Saskatoon, SK - S7K 1Y4

         Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
   Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Mime
View raw message