spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Katz <antis...@khopis.com>
Subject Re: using spamhaus droplist with sa ?
Date Tue, 22 Feb 2011 20:55:27 GMT
Andreas Schulze began:
>>>> http://www.spamhaus.org/faq/answers.lasso?section=DROP+FAQ 
>>>> mention as very last point to use the Spamhaus Drop list with
>>>> SA.

Yet Another Ninja continued:
>>> "DROP is a tiny subset of the SBL designed for use by firewalls
>>> and routing equipment."
>>> 
>>> Using it postqueue is pretty pointless as its basically a "safe" 
>>> subset of SBL

RW added:
>> The suggestion is that it be scored higher for that reason.

>>>> is anybody doing this and can explain it in detail ?

Yet Another Ninja answered:
> if that is what you wish, you can setup a local rbldnsd zone and
> query that.

That's nontrivial since there is no DNSBL serving it.  Setting one up
requires regularly scraping that data.  The same would go if you were to
create a SpamAssassin rule from it.

As a proof-of-concept, I have done the latter and added it as
KHOP_SPAMHAUS_DROP and KHOP_SPAMHAUS_DROP_LE (which checks only the
last-external relay) to my data-scraping sa-update channel
khop-sc-neighbors for testing.  It only runs in certain circumstances
and is scored very low as its still testing.  The resulting rule
contains a 5817-char regexp (from 3632 IP addresses in 402 CIDRs from a
6311-char source), which is more than twice the size of KHOP_SC_TOP200,
the channel's previously longest entry; twice the space for twice the
entries (18x the IPs).

Like KHOP_SC_TOP200, I optimized for performance by scoring it zero
(skipping its evaluation) in the presence of DNSEval:

score    KHOP_SPAMHAUS_DROP     0.5 0 0.5 0
if (! plugin(Mail::SpamAssassin::Plugin::DNSEval) )
  score  KHOP_SPAMHAUS_DROP     (0) (0.3) (0) (0.1)
endif

I've had this sitting in SVN for a few days now.  It hits almost
nothing, but it is actually interesting; only 72% of the broader rule's
hits are mirrored in RCVD_IN_SBL.  The _LE rule has 93% overlap with SBL
(I was expecting 99+%).

The biggest surprise was that both rules have almost their entire score
map matching corpus messages at or under 8 points.

   Corpus         T_KHOP_SPAMHAUS_DROP     T_KHOP_SPAMHAUS_DROP_LE
DateRev #spam  spam%  ham%  s/o rank SBL% spam% ham%   s/o rank SBL%
20110221 576k  .0323 .0030 .914  .54  72  .0217    0 1.000  .52  93
20110220 599k  .0314 .0031 .911  .54  72  .0209    0 1.000  .53  93
20110219 176k  .0996 .0041 .960  .55  72  .0660    0 1.000  .53  93
20110218 595k  .0315 .0031 .910  .54  72  (not added yet)

PMCs:  I'd love to see the timing.log output so as to better measure
these rules' merit.  Actually, why isn't that data public on ruleqa?  If
it's too time-consuming, restrict it to the weekly network runs.



Mime
View raw message