drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #936: DRILL-5772: Add unit tests to indicate how utf-8 support c...
Date Wed, 13 Sep 2017 00:36:06 GMT
Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/936
  
    @arina-ielchiieva, thanks for the explanation. Drill's runtime framework assumes that
data is either:
    
    1. ASCII (or, at least, single-byte character set based on ASCII), or
    2. UTF-8 (data is converted to/from UTF-8 when converting VarChar to a Java String.)
    
    Since Drill's code seems to assume ASCII (when it cares about character format), then
one could claim that Drill does not have an encoding: it just treats characters as bytes.
But, things such as determining string length, doing pattern matching, and so on must be aware
of the character set -- if only to know which bytes are continuations of a multi-byte character.
(That is, a three-byte sequence in UTF-8 might be one, two or three characters, depending.)
    
    Now, if the planner assumes ISO-8859-1, but the Drill execution engine assumes UTF-8,
then string constants passed from one to the other can become corrupted in the case where
a particular byte sequence in ISO-8859-1 represents a different character than that same byte
sequence in UTF-8.
    
    Where would this occur? Look at the [ISO-8859-1](https://en.wikipedia.org/wiki/ISO/IEC_8859-1)
definition. ISO-8859 is a single-byte character set with meanings associated to the bytes
in the range 0x40 to 0x7f. But, in UTF-8, the high bit indicates a prefix character. So, 0xF7
is a valid single-byte character in ISO-8859, but is a lead-in character in UTF-8.
    
    The point here is that setting the character set would seem to be a global setting. If
the Saffron setting is purely for the parser (how to interpret incoming text), and the parser
always produces Java strings in UTF-16 (which are then encoded into UTF-8 for execution),
then we're fine.
    
    But, if the parser encoding is written as bytes sent to the execution engine, we're in
trouble.
    
    Further, Drill has a web UI. The typical web character set is UTF-8, so queries coming
from the web UI are encoded in UTF-8.
    
    All this suggests two things:
    
    1. Drill should either always accept UTF-8 (the Saffron property should always be set.)
or
    2. The property is specified by the client and used to decode the bytes within a Protobuf
message to produce a character stream given to the parser.
    
    It appears that UTF-8 is the default Protobuf String type encoding; sender and receiver
would have to agree on another format. Does Drill have such an RPC property? I've not seen
it, but I'm not an expert.
    
    In short, if this change ensures that the parser *always* uses UTF-8, then this is good.
If the character encoding is an option, then we have to consider all the above issues to have
a fully working, end-to-end solution.


---

Mime
View raw message