spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RW <>
Subject Re: Bayes underperforming, HTML entities?
Date Fri, 30 Nov 2018 13:09:20 GMT
On Thu, 29 Nov 2018 22:33:12 -0700
Amir Caspi wrote:

> On Nov 29, 2018, at 10:11 PM, Bill Cole
> <> wrote:
> > 
> > I have no issue with adding a new rule type to act on the output of
> > a partial well-defined HTML parsing, something in between 'rawbody'
> > and 'body' types, but overloading normalize_charset with that and
> > so affecting every existing rule of all body-oriented rule types
> > would be a bad design.  
> The problem as I see it is that spammers are using HTML encoding as
> effectively another charset, and as a way of obfuscating like they
> did/do with Unicode lookalikes... but unless those HTML characters
> are translated there is no way to catch this obfuscation.

normalize_charset is about converting  text from whatever character set
it's in to UTF-8, and nothing else. SpamAssassin should already decode
HTML to text for body rules. Rules matching the HTML entities use
rawbody specifically to avoid having them converted to plain text.

The most substantial problem here is that these invisible characters
make it very hard to write ordinary body rules.

View raw message