spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reindl Harald <h.rei...@thelounge.net>
Subject Re: charset=utf-16 tricks out SA
Date Sat, 10 Oct 2015 08:36:01 GMT
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7252 with the sample 
and link to this list thread - major because the sample is just a 
english mail tricking out SA and if spammers find that information i 
expect a flood sooner or later - not disclose the problem and so get it 
fixed won't make things better over the long

Am 10.10.2015 um 03:03 schrieb RW:
> On Fri, 09 Oct 2015 14:22:18 +0200
> Mark Martinec wrote:
>
>> The problem with this message is that it declares encoding
>> as UTF-16, i.e. not explicitly stating endianness like
>> UTF-16BE or UTF-16LE, and there is no BOM mark at the
>> beginning of each textual part, so endianness cannot be
>> determined. The RFC 2781 says that big-endian encoding
>> should be assumed in absence of BOM.
>> See https://en.wikipedia.org/wiki/UTF-16
>>
>> In the provided message the actual endianness is LE, and
>> BOM is missing, so decoding as UTF-16BE fails and the
>> rule does not hit. Garbage-in, garbage-out.
>
>
> I'm not seeing any body tokens, even after training.
>
> I was expecting that the text would be tokenized as individual UTF-8
> sequences. ASCII characters encoded as UTF-16 and decoded with the
> wrong endianness are still valid UTF-16. Normalizing them into
> UTF-8 should produce completely multi-byte UTF-8 without whitespace or
> punctuation (not counting U+2000 inside UTF-8).
>
> If I add John Hardin's diagnostic rule
>
> body     __ALL_BODY     /.*/
> tflags   __ALL_BODY     multiple
>
> I get:
>
> ran body rule __ALL_BODY ======> got hit: " _ _D_e_a_r_
> _p_o_t_e_n_c_i_a_l_ _p_a_r_t_n_e_r_,_ _ _W_e_ _a_r_e_
> _p_r_o_f_e_s_s_i_o_n_a_l_ _i_n_ _e_n_g_i_n_e_e_r_i_n_g_,_
> _...
>
> It looks like it's still UTF-16, and Bayes is seeing individual
> letters (which are too short to be tokens) separated by nulls.
>
> If I change the mime to utf-16le it works correctly, except that the
> subject isn't converted - including the copy in the body.  If I set the
> mime to utf-16le I get what appears to be the multi-byte UTF-8 I was
> expecting.
>
> So SA isn't falling back to big-endian, it wont normalize without an
> explicit endianess.
>
>
> BTW with normalize_charset 0 it looks like a spammer can effectively
> turn-off body tokenization by using UTF-16 (with correct endianness)


Mime
View raw message