spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matus UHLAR - fantomas <uh...@fantomas.sk>
Subject Re: SA memory (Re: ".*" in body rules)
Date Wed, 11 Dec 2019 12:12:46 GMT
>On Wed, Dec 11, 2019 at 10:53:04AM +0100, Matus UHLAR - fantomas wrote:
>> On 11.12.19 11:43, Henrik K wrote:
>> >Wow 6 million tokens.. :-)
>> >
>> >I assume the big uuencoded blob content-type is text/* since it's tokenized?

>> yes, I mentioned that in previous mails. ~15M file, uuencoded in ~20M mail.
>>
>> grep -c '^M' spamassassin-memory-error-<...>
>> 329312
>>
>> One of former mails mentioned that 20M mail should use ~700M of RAM. 6M
>> tokens eating about 4G of RAM means ~750B per token, is that fine?

On 11.12.19 12:07, Henrik K wrote:
>I'm pretty sure the Bayes code does many dumb things with the tokens
>that result in much memory usage for abnormal cases like this.

but apparently nobody notices...

>> >This will be mitigated in 3.4.3, since it will only use max 50k of the body
>> >text (body_part_scan_size).

>> will it prefer test parts and try to avoid uuencoded or base64 parts?
>> (or maybe decode them?)

>There is no change in how parts are processed.  As before, "body" is
>concatenated result of all textual parts.  But in 3.4.3 atleast each part is
>truncated to 50k.  If there are several parts then it's 50+50k etc..

I understand such change apparently should not be done in minor version.

Well, I tried on currently unused machine with 16G of RAM, moved bayes DB
there (scanning on account without bayes was fast even on the original one,
with lower, maybe mentioned ~700M memory usage).

scanning took 17 minutes topping on 4.8G mem.

when I have tried to check with redis (copied bayes DB there), scanning
topped on 3.8G but took 29 minutes (???), even with repeated test.

I understand I probably push too far, but you never know in advance.

I also understand redis is great with parallel scanning.



I include logs from scanning on filesystem bayes, including places where
biggest differencies are:


Dec 11 10:45:42.261 [12972] dbg: logger: adding facilities: all
...
Dec 11 10:45:43.969 [12972] dbg: message: ---- MIME PARSER END ----
Dec 11 10:45:44.038 [12972] dbg: message: no encoding detected
Dec 11 10:46:10.379 [12972] dbg: plugin: Mail::SpamAssassin::Plugin::URIDNSBL=HASH(0x5617d8cf6c48)
implements 'parsed_metadata', priority 0
Dec 11 10:46:23.131 [12972] dbg: uridnsbl: more than 20 URIs, picking a subset
...
Dec 11 10:46:23.272 [12972] dbg: async: starting: DNSBL-A, dns:A:70.175.80.195.iadb.isipp.com
(timeout 15.0s, min 3.0s)
Dec 11 10:48:22.828 [12972] dbg: check: check_main, time limit in 1639.598 s
...
Dec 11 10:48:23.005 [12972] dbg: bayes: corpus size: nspam = 89264, nham = 17109
Dec 11 10:49:30.445 [12972] dbg: bayes: tokenized body: 6158242 tokens
Dec 11 10:49:35.335 [12972] dbg: bayes: tokenized uri: 10881 tokens
Dec 11 10:49:35.351 [12972] dbg: bayes: tokenized invisible: 0 tokens
Dec 11 10:49:35.354 [12972] dbg: bayes: tokenized header: 208 tokens
Dec 11 10:50:54.200 [12972] dbg: bayes: score = 0.5
...
Dec 11 10:50:54.202 [12972] dbg: check: tagrun - tag TOKENSUMMARY is now ready, value: CODE(0x5617de4969e8)
Dec 11 10:50:58.537 [12972] dbg: async: select found no responses ready (t.o.=0.0)
Dec 11 10:50:58.537 [12972] dbg: async: queries completed: 0, started: 0
Dec 11 10:50:58.537 [12972] dbg: async: queries active: DNSBL-A=4 DNSBL-TXT=2 URI-A=9 URI-DNSBL=20
URI-NS=10, all expired at Wed Dec 11 10:50:58 2019
Dec 11 10:51:01.653 [12972] dbg: rules: running rawbody tests; score so far=-0.699
...
Dec 11 10:51:02.711 [12972] dbg: rules: compiled body tests
Dec 11 10:51:08.066 [12972] dbg: rules: ran body rule __hk_bigmoney ======> got hit: "$NK7M"
Dec 11 10:52:00.372 [12972] dbg: rules: ran body rule __DRUGS_MUSCLE1 ======> got hit:
"@S"'<0 MA[+*"
Dec 11 10:52:01.853 [12972] dbg: rules: ran body rule __LOTSA_MONEY_03 ======> got hit:
"$3M"
Dec 11 10:52:01.886 [12972] dbg: rules: ran body rule __DOS_BODY_WED ======> got hit: "WED"
Dec 11 10:52:05.859 [12972] dbg: rules: ran body rule __LOTSA_MONEY_01 ======> got hit:
"$94O0541"
Dec 11 10:52:31.895 [12972] dbg: rules: ran body rule __HAS_ANY_EMAIL ======> got hit:
"a@nspnz.s"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_SUN ======> got hit: "SUN"
Dec 11 10:52:53.298 [12972] dbg: rules: ran body rule __DOS_BODY_TUE ======> got hit: "Tuesday"
Dec 11 10:53:01.629 [12972] dbg: rules: ran body rule __FIFTY_FIFTY ======> got hit: "50%"
Dec 11 10:53:04.870 [12972] dbg: rules: ran body rule __DOS_BODY_SAT ======> got hit: "SAT"
Dec 11 10:53:06.939 [12972] dbg: rules: ran body rule __DOS_BODY_FRI ======> got hit: "FRI"
Dec 11 10:53:06.951 [12972] dbg: rules: ran body rule __freemail_safe_fwd ======> got hit:
"---Original Message"
Dec 11 10:56:14.590 [12972] dbg: rules: ran body rule __FRAUD_DBI ======> got hit: "$,,
M"
Dec 11 10:56:58.462 [12972] dbg: rules: ran body rule __FB_COST ======> got hit: "COST"
Dec 11 10:57:02.611 [12972] dbg: rules: ran body rule FUZZY_PRICES ======> got hit: "PR!@*3Z"
Dec 11 10:57:07.993 [12972] dbg: rules: ran body rule WEIRD_QUOTING ======> got hit: """,`_'2""""
Dec 11 10:57:11.069 [12972] dbg: rules: ran body rule FUZZY_CPILL ======> got hit: "KYO11Z"
Dec 11 10:57:21.916 [12972] dbg: rules: ran body rule __LOTSA_MONEY_02 ======> got hit:
"2,3O964$"
Dec 11 10:57:47.954 [12972] dbg: rules: ran body rule __DOS_BODY_THU ======> got hit: "THU"
Dec 11 10:58:03.551 [12972] dbg: rules: ran body rule __LOTSA_MONEY_04 ======> got hit:
"1MN98USD"
Dec 11 10:58:10.930 [12972] dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit:
"R"
Dec 11 10:58:29.975 [12972] dbg: rules: ran body rule FUZZY_CREDIT ======> got hit: "CREDYT"
Dec 11 10:58:42.635 [12972] dbg: rules: ran body rule __FUZZY_DR_OZ ======> got hit: "DGC0S
"
Dec 11 10:58:54.772 [12972] dbg: rules: ran body rule __DOS_BODY_TICKER ======> got hit:
"MVYR.PK"
Dec 11 10:59:20.483 [12972] dbg: rules: ran body rule __FB_NUM_PERCNT ======> got hit:
"0%"
Dec 11 10:59:20.490 [12972] dbg: rules: ran body rule __DOS_BODY_MON ======> got hit: "MON"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit:
"R"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit:
"M"
Dec 11 10:59:25.221 [12972] dbg: rules: ran body rule __BODY_TEXT_LINE ======> got hit:
"<CF>"
Dec 11 11:00:24.968 [12972] dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "X;#0NA%X"
Dec 11 11:02:29.635 [12972] dbg: dns: bgread: received 113 bytes from 10.51.1.14
...
Dec 11 11:02:31.471 [12972] dbg: rules: compiled rawbody tests
Dec 11 11:02:35.862 [12972] dbg: rules: ran rawbody rule __HTML_SINGLET ======> got hit:
">W<"
...
Dec 11 11:02:36.349 [12972] dbg: rules: [...] M5ULS"=>P"
Dec 11 11:02:37.267 [12972] dbg: async: select found no responses ready (t.o.=0.0)
...
Dec 11 11:02:37.281 [12972] dbg: check: ascii_text_illegal: matches >> Odoslan<e9>
z iPhonu
Dec 11 11:02:38.039 [12972] dbg: async: select found no responses ready (t.o.=0.0)
...
Dec 11 11:02:38.064 [12972] dbg: dns: entering helper-app run mode
Dec 11 11:02:43.064 [12972] dbg: dns: leaving helper-app run mode
...
Dec 11 11:02:43.735 [12972] dbg: netset: cache trusted_networks hits/attempts: 8/10, 80.0
%
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
I feel like I'm diagonally parked in a parallel universe.

Mime
View raw message