avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keh-Li Sheng (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AVRO-1190) C++ json parser fails to decode multibyte unicode code points
Date Mon, 05 Nov 2012 18:48:12 GMT
Keh-Li Sheng created AVRO-1190:
----------------------------------

             Summary: C++ json parser fails to decode multibyte unicode code points
                 Key: AVRO-1190
                 URL: https://issues.apache.org/jira/browse/AVRO-1190
             Project: Avro
          Issue Type: Bug
          Components: c++
    Affects Versions: 1.7.0
            Reporter: Keh-Li Sheng


The parser in JsonIO.cc does not handle decoding a multibyte unicode character into any kind
of valid character encoding for a std::string in c++. The following snippet from JsonParser::tryString()
has several flaws:

1. sv is a std::string used as a vector, where each unit is a char
2. a single unicode hex quad encoded in JSON can represent a 16-bit value
3. a unicode hex quad can represent a "high surrogate" character meaning that it must be combined
with the following quad to derive the full unicode code point
4. \U is not a valid unicode escape for JSON (see http://www.ietf.org/rfc/rfc4627.txt)

{code:title=JsonIO.cc}
            case 'u':
            case 'U':
                {
                    unsigned int n = 0;
                    char e[4];
                    in_.readBytes(reinterpret_cast<uint8_t*>(e), 4);
                    for (int i = 0; i < 4; i++) {
                        n *= 16;
                        char c = e[i];
                        if (isdigit(c)) {
                            n += c - '0';
                        } else if (c >= 'a' && c <= 'f') {
                            n += c - 'a' + 10;
                        } else if (c >= 'A' && c <= 'F') {
                            n += c - 'A' + 10;
                        } else {
                            throw unexpected(c);
                        }
                    }
                    sv.push_back(n);
                }
{code}

This code loop creates a temporary int then decodes the quad into it and then simply pushes
the int (which may be a 16-bit value) onto the std::string. This essentially means that the
JSON parser does not decode any unicode characters. For example, this JSON string:

{noformat}
"Dress up if you dare! Free cover all night! \uD83C\uDF83\uD83D\uDC7B"
{noformat}

results in a decoded byte sequence for the last 4 characters:

{noformat}
3C 83 3D 7B 00
{noformat}

where you can see that it simply drops the high order bytes. In this particular example, \uD83C
is a high-surrogate character which requires some additional handling. I am not sure what
users of the c++ library expect the encoding to be, but given that we are working with json,
I would assume users would expect a UTF-8 encoded string. There are many examples of decoders
that handle this string properly - I found this one helpful while implementing a fix: http://rishida.net/tools/conversion/

For basics on UTF-8 http://www.utf-8.com/

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message