spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Chan <je...@surbl.org>
Subject Re: Announcing SpamCopURI 0.08 support of SURBL for spam URI domain tests
Date Fri, 02 Apr 2004 10:11:39 GMT
On Thursday, April 1, 2004, 11:37:54 PM, Daniel Quinlan wrote:
> Jeff Chan <jeffc@surbl.org> writes:

>> Would someone with access to large spam and ham corpi please give
>> SpamCopURI a try against their recent data, as Daniel Quinlan did with
>> URIDNSBL + SURBL, and kindly let us know what kind of results they
>> obtain?  Currently four trailing days of SpamCop URI reports are
>> represented in SURBL.

> 2.6x modules, rules, and patches aren't very interesting right now.
> Give me a patch against URIDNSBL in 3.0 to add domain-to-domain testing
> and I'll gladly give it a whirl.

I would do that immediately if I knew how to write one.  I've
been rewriting my data stuff lately, while letting Eric update
SpamCopURI to now use SURBL.  (The somewhat frustrating thing is
that someone already familiar with SA 3.0 plugins could probably
make such a patch for URIDNSBL in a small fraction of the time it
would take me to come up to speed.  But I realize everyone else
is short of time also.)

> Four days still seems rather low.

What would be a better expiration time, and how do you suggest
removing from the blacklist domains that are no longer active in
spams?

We can expire after any arbitrary number of days.  I'm leaning
towards seven days right now since it's a typical DNS cacheout
interval. 

> Bear in mind that we're testing
> corpora that have spams somewhere between 0 and 3 months old (on
> average).  SpamCop is very hard to accurately gauge because stuff leaves
> so quickly.

True, but it also accurately reflects spams that people are
actually getting and reporting at any given moment.  To me
that feature has a significant value in timeliness.

If it's the case that domains expire out of the SpamCop
URI data sooner than the particular spam domains remain
a problem, then I could definitely see a need for a longer
expiration.  Being somewhat new to the game, I don't
have any data to support either argument.

My intuition is that if a domain continued to appear
in spam, people would continue to report it, and it
would therefore continue to show up in our SURBL data.
I'm interested in finding out what I may be overlooking
in this assumption.

Do you or anyone else here have some data that might shed
some light on this question?

> Expiring stuff quickly doesn't really reduce FPs unless
> you're testing old ham vs. new spam.  I care more about the S/O ratio
> (spam/overall where overall=ham+spam for a 50/50 mix of spam and ham).

My priorities are near zero FPs and near 100% accuracy in
the spams we do tag.  I don't guarantee that we will tag
all spams, but I'd like the ones we say are spam to actually
*be* spam.  Verity is important to me.

Other techniques may be able to catch spams which we miss, and we
may be able to improve our process to catch more spams our way.
I also think our spam% will be very high if the SpamCop reports
represent a good cross-section of actual spams at any given time.

Comments?  Surely I'm missing something...  ;)

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org-nospam
http://www.surbl.org/


Mime
View raw message