spamassassin-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bill Cole" <sausers-20150...@billmail.scconsult.com>
Subject Re: SA Concepts - plugin for email semantics
Date Tue, 31 May 2016 00:30:48 GMT
On 30 May 2016, at 18:25, Dianne Skoll wrote:

> On Mon, 30 May 2016 17:45:52 -0400
> "Bill Cole" <sausers-20150205@billmail.scconsult.com> wrote:
>
>> So you could have 'sex' and 'meds' and 'watches' tallied up in into
>> frequency counts that sum up natural (word) and synthetic (concept)
>> occurrences, not just as incompatible types of input feature but as
>> a conflation of incompatible features.
>
> That is easy to patch by giving "concepts" a separate namespace.  You
> could do that by picking a character that can't be in a normal token 
> and
> using something like:  concept*meds, concept*sex, etc. as tokens.

Yes, but I'd still be reluctant to have that namespace directly blended 
with 1-word Bayes because those "concepts" are qualitatively different: 
inherently much more complex in their measurement than words. Robotic 
semantic analysis hasn't reached the point where an unremarkable machine 
can decide whether a message is porn or a discussion of current 
political issues, and I would not hazard a guess as to which actual 
concept in email is more likely to be spam or ham these days. Any old 
mail server can of course tell whether the word 'Carolina' is present in 
a message, which probably distributes quite disproportionately towards 
ham.

>> FWIW, I have roughly no free time for anything between work and
>> family demands but if I did, I would most like to build a blind
>> fixed-length tokenization Bayes classifier: just slice up a message
>> into all of its n-byte sequences (so that a message of bytelength x
>> would have x-(n-1) different tokens) and use those as inputs instead
>> of words.
>
> I think that could be very effective with (as you said) plenty of
> training.  I think there *may* be slight justification for
> canonicalizing text parts into utf-8 first; while you are losing
> information, it's hard to see how ζ‰‹ζœΊθ‰²ζƒ… should be treated
> differently depending on the character encoding.

Well, I've not thought it through deeply, but an evasion of the charset 
issue might be to just decode any Base64 or QP transfer encoding (which 
can be path-dependent rather than a function of the sender or content) 
to get 8-bit bytes and use 6-byte tokens as if it was all 1-byte chars. 
UCS-4 messages would be a wreck, but pairs of non-ASCII chars in UTF-8 
would be seen cleanly once and as an aura of 10 semi-junk tokens around 
them, in a manner that might effectively wash itself out. Or go to 
12-byte tokens and get the same effect with UCS-4. Or 3-byte tokens: 
screw 32-bit charsets, screw encoding semantics of UTF-8, just have 16.8 
million possible 24-bit tokens and see how they distribute. It seems to 
me that this is almost the ultimate test for Naive Bayes text analysis: 
break away from the idea that the input features have any innate meaning 
at all, let them be pure proxies for whatever complex larger patterns 
give rise to them.

Oh, and did I mention that Bayes' Theorem has different 
"interpretations" in the same way Heisenberg's Uncertainty Principle and 
quantum superposition do? 24-bit tokens could settle the dispute...

Mime
View raw message