spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amir Caspi <ceph...@3phase.com>
Subject Re: Bayes underperforming, HTML entities?
Date Fri, 30 Nov 2018 05:33:12 GMT
On Nov 29, 2018, at 10:11 PM, Bill Cole <sausers-20150205@billmail.scconsult.com> wrote:
> 
> I have no issue with adding a new rule type to act on the output of a partial well-defined
HTML parsing, something in between 'rawbody' and 'body' types, but overloading normalize_charset
with that and so affecting every existing rule of all body-oriented rule types would be a
bad design.

The problem as I see it is that spammers are using HTML encoding as effectively another charset,
and as a way of obfuscating like they did/do with Unicode lookalikes... but unless those HTML
characters are translated there is no way to catch this obfuscation.

In other words — the encoded entities DISPLAY as something different than the content over
which rules run... and because encoding is cumbersome and not human-readable, it also makes
writing rules to catch these MUCH harder. Worse yet, they evade Bates almost completely because
the encoded words don’t tokenize well.

Maybe normalize_charset isn’t the right place to do it, but it seems like there should be
some way of converting HTML-encoded entities into their single-character ASCII or Unicode
equivalents before body rules and especially before Bayes tokenization, so that we can tokenize
and run our rules on the -displayed- text and not the encoded text...

How best to achieve this?

--- Amir

Mime
View raw message