spamassassin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzilla-dae...@bugzilla.spamassassin.org
Subject [Bug 6793] PATCH reduce sa-awl memory usage
Date Tue, 24 Apr 2012 07:47:11 GMT
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6793

Vitaly V. Bursov <vitalyb@telenet.dn.ua> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vitalyb@telenet.dn.ua

--- Comment #4 from Vitaly V. Bursov <vitalyb@telenet.dn.ua> 2012-04-24 07:47:11 UTC
---
New patch has the same root issue as original sa-awl has - potentially stores a
lot of keys in memory. I had to reduce a DB from 8M keys to 800K and proposed 
algorithm would keep around 3M in memory keys and if 800K keys eat 1G of RAM on
x86-64.... well, not much better than original.

Looks like there are few options then.

1. Create a new DB file and replace the old one with it. Probably it's hard to
do correctly if SA is running.

2. Leave missed or duplicate keys as is - probably duplicates are harmless and
missed keys will be handled on next runs. Ugly.

3. Modify the algorithm. Few options here as well:

a) check size of @delete_keys, stop iteration if over limit, delete keys, start
all over again (very slow);

b) if we got 'totscore' key, also get and check 'count' (strip off /totscore$/
part from the key name), if it's 'count' key proceed as usual. If keys should
be deleted delete the current one and remember another in @delete_keys. The
trick is that keys should not be deleted afterwards all at once but on every
iteration key name should be checked if it's in @delete_keys and if so, this
key must be deleted and removed from @delete_keys. The size of @delete_keys
should be checked also to keep it from consuming too much memory.

c) store keys that should be deleted in another on-disk DB.

Few more thoughts.

It's hard to predict how efficient method 3b going to be compared to 3a as keys
are randomly sorted (I think), probably it's best to have two implementations
and make it an option - the 'fast' one but not entirely correct for huge DBs
(like 2) and, default, correct one and not so fast (like 3b) for small-medium
sized DBs.

Hope this helps,
Thanks.

-- 
Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Mime
View raw message