avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Karp (JIRA)" <j...@apache.org>
Subject [jira] [Created] (AVRO-1517) Unicode strings are accepted as bytes type by perl API
Date Fri, 23 May 2014 16:04:01 GMT
John Karp created AVRO-1517:

             Summary: Unicode strings are accepted as bytes type by perl API
                 Key: AVRO-1517
                 URL: https://issues.apache.org/jira/browse/AVRO-1517
             Project: Avro
          Issue Type: Bug
          Components: perl
            Reporter: John Karp
            Assignee: John Karp

By default in perl, a string is a sequence of bytes, values 0-255. However, if a Unicode character
is included that cannot be represented with a single byte, the string gets 'upgraded' to a
non-byte-based Unicode string allowing ordinals outside that range. When string operations
are done with byte and non-byte Unicode strings, the result is always non-byte, with the byte
string first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag on the
string. ('utf8' is a variant of UTF-8 used by perl)

The perl Avro API is accepting these Unicode strings as-is for the 'bytes' type. This is a
problem because 1) bytes and Unicode characters are not interchangeable, and if the user declares
they are going to provide bytes they should provide bytes; any encoding is their job. 2) As
Avro assembles the serialized data, perl 'upgrades' all the data, having the effect of utf8
encoding our serialized binary data.

The correct behavior is for the Avro perl API to raise an error when encoding 'bytes' and
a Unicode string has been provided. (The behavior of 'string' won't change, it will still
take Unicode strings as expected.)

This message was sent by Atlassian JIRA

View raw message