arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jochen Ott (JIRA)" <>
Subject [jira] [Created] (ARROW-374) Python: clarify unicode vs. binary in API
Date Wed, 09 Nov 2016 07:27:58 GMT
Jochen Ott created ARROW-374:

             Summary: Python: clarify unicode vs. binary in API
                 Key: ARROW-374
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.1.0
            Reporter: Jochen Ott
            Priority: Minor

pyarrow supports arrow's String type, arrow-internally represented as BINARY+UTF8 annotation.

In python 2, the pyarrow API accept both {{unicode}} and binary strings ({{str}}), where the
latter are assumed to be utf-8 encoded. I find this approach problematic, because:
 * there is an implicit assumption that a binary {{str}} contains valid utf-8 data. This assumption
can be wrong, however, and it's not clear what the consequences are of passing such "invalid
data" to the API are.
 * the utf-8 assumption is not clearly documented  or otherwise visible from the API
 * if pyarrow wants to support pure binary data in the future, a natural choice would be to
use {{str}} as python2 type. However, this would conflict with the current interpretation
of binary {{str}} as BINARY+UTF8

*Proposed solution*
I propose to change the API that it only accepts or returns unicode strings, i.e. python2's
{{unicode}} and python3's {{str}}. Passing a python2 {{str}} should raise an exception, same
for python3's {{bytes}}.
If in some point in the future also raw BINARY is supported, use python3's {{bytes}} and python2's

As convenience feature for API users, the API may allow to also pass utf-8 encoded binary
data as arrow's String, but that should be an explicit, opt-in choice, s.t. API users are
aware of the (encoding-)assumptions made.

This message was sent by Atlassian JIRA

View raw message