spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzilla-dae...@issues.apache.org
Subject [Bug 6508] New: Speeding up lookups on {trusted,internal,msa}_networks
Date Thu, 04 Nov 2010 17:27:13 GMT
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6508

           Summary: Speeding up lookups on {trusted,internal,msa}_networks
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: Mark.Martinec@ijs.si


Prompted by the current brokenness in NetAddr::IP 4.034 and 4.035
(Bug 6507) and after Philip Prindeville pointed out that a
Net::Patricia module might be a useful replacement or addition,
I have instrumented the module Mail::SpamAssassin::NetSet with
timing reports, and later added the use of Net::Patricia to it
- with very encouraging (and worrying) results.

Btw, the Net::Patricia module implements radix trie lookups, which are
well suited for fixed size keys (such as 128 bits of a network address)
and does a lookup in constant time O(k), proportional only to a size
of a network address (fixed 128 bits), and independent of a size of
the trie. As such it is commonly used in IP routing, efficiently
finding a closest match from a set of available routes.
See:
  http://en.wikipedia.org/wiki/Patricia_trie


Here are some benchmarking results measured on real life mail,
on an AMD quad core Opteron 2376:


Elapsed time PER MESSAGE spent in NetSet lookups (sub contains_ip)
for a set of n networks (using about 15 DNS white-/black lists):

 nets   current
 ----   --------
   27    22 ms
 1000   500 ms
 7000  3700 ms


Elapsed time for looking up ONE IP ADDRESS in a set of n networks:

 nets   current   patricia trie
 ----   --------  -------------
   27     1.5 ms  0.07-0.18 ms
 1000    34 ms    0.10-0.20 ms
 7000   250 ms    0.10-0.20 ms

See also:
  Bug 5931 - add_cidr chokes on large amount of trusted_networks

Conclusion 1: for anything above 4 or so networks it pays off
to use Net::Patricia lookups.

Conclusion 2: the Plugin/DNSEval.pm performs one lookup on the
trusted_networks list for each DNSBL response, all of these are
lookups on THE SAME IP address. It clearly calls for some caching
of NetSet lookup results within processing of each message.


Now, there is one caveat with using a Patricia trie for lookups:
it finds the tightest match on listed CIDR networks - unlike our
present sequential search, which returns the first match in the
order of declared networks. For all practical purposes on valid
data this makes no difference. The difference pops up on overlapping
conflicting networks. Some of these are reported by lint, but not
all. For this reason a couple of t/trust_path.t tests currently
fail when Net::Patricia is installed and NetSet decides to use it.
If Net::Patricia is not installed, the NetSet just uses the old
sequential search on a list of NetAddr::IP objects.

Guessing that Hudson does not have the Net::Patricia installed,
I'd just roll in the change to trunk for ease of assessment,
and can back off patricia or change the trust_path.t tests later
on when we decide what would be the best. Would that be alright?

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Mime
View raw message