avro-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Karp (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (AVRO-1517) Unicode strings are accepted as bytes type by perl API
Date Wed, 28 May 2014 16:53:01 GMT

     [ https://issues.apache.org/jira/browse/AVRO-1517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

John Karp updated AVRO-1517:

    Status: Open  (was: Patch Available)

Behavior should be friendlier than provided in patch, if is_utf8 try to downgrade automatically,
only throw error if cannot be downgraded. Will make new patch.

> Unicode strings are accepted as bytes type by perl API
> ------------------------------------------------------
>                 Key: AVRO-1517
>                 URL: https://issues.apache.org/jira/browse/AVRO-1517
>             Project: Avro
>          Issue Type: Bug
>          Components: perl
>            Reporter: John Karp
>            Assignee: John Karp
>         Attachments: AVRO-1517-0.patch
> By default in perl, a string is a sequence of bytes, values 0-255. However, if a Unicode
character is included that cannot be represented with a single byte, the string gets 'upgraded'
to a non-byte-based Unicode string allowing ordinals outside that range. When string operations
are done with byte and non-byte Unicode strings, the result is always non-byte, with the byte
string first 'upgraded'. Upgrading consists of utf8 encoding and setting a utf8 flag on the
string. ('utf8' is a variant of UTF-8 used by perl)
> The perl Avro API is accepting these Unicode strings as-is for the 'bytes' type. This
is a problem because 1) bytes and Unicode characters are not interchangeable, and if the user
declares they are going to provide bytes they should provide bytes; any encoding is their
job. 2) As Avro assembles the serialized data, perl 'upgrades' all the data, having the effect
of utf8 encoding our serialized binary data.
> The correct behavior is for the Avro perl API to raise an error when encoding 'bytes'
and a Unicode string has been provided. (The behavior of 'string' won't change, it will still
take Unicode strings as expected.)

This message was sent by Atlassian JIRA

View raw message