spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hardin <>
Subject Re: No longer just embedded =9D characters in blackmail emails.
Date Wed, 05 Dec 2018 22:27:28 GMT
On Wed, 5 Dec 2018, Grant Taylor wrote:

> On 12/05/2018 02:45 PM, John Hardin wrote:
>> I've added a "too many [ascii][unicode][ascii]" rule based on that but I 
>> suspect it will be pretty FP-prone and will be pretty large if we want to 
>> avoid whack-a-mole syndrome. For this, normalize + bayes is probably the 
>> best bet.
> Is it possible to detect when a Unicode code point is being used in place of 
> an ASCII / ANSI character specifically to avoid pattern detection?  I.e. 
> multiple Unicode code points that represent or are otherwise a stand in for 
> an ASCII / ANSI "a"?

Take a look at replace_rules in the repo (both standard and sandboxes).

> Or is keeping up with this list tantamount to whack-a-mole?

The unicode replacements are fairly stable, it's looking for specific 
obfuscated words (like "bitcoin") that's whack-a-mole.

> I would think that too high of a percentage of Unicode when bog standard 
> ASCII / ANSI would suffice would be an indication in and of itself.  I'm not 
> seeing how legitimate (non-spam) email would trigger a false positive if the 
> percentage was tuned correctly.

The problem there is, that's really strongly based towards English text. 
Spanish and French, for example, would have ASCII, but it would also have 
a fairly high proportion of accented characters.

  John Hardin KA7OHZ              FALaholic #11174     pgpk -a
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
   The problem is when people look at Yahoo, slashdot, or groklaw and
   jump from obvious and correct observations like "Oh my God, this
   place is teeming with utter morons" to incorrect conclusions like
   "there's nothing of value here".        -- Al Petrofsky, in Y! SCOX
  2 days until The 77th anniversary of Pearl Harbor

View raw message