arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Armin Berres (JIRA)" <>
Subject [jira] [Created] (ARROW-3650) [Python] Mixed column indexes are read back as strings
Date Tue, 30 Oct 2018 08:45:00 GMT
Armin Berres created ARROW-3650:

             Summary: [Python] Mixed column indexes are read back as strings 
                 Key: ARROW-3650
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.11.1
            Reporter: Armin Berres

Consider the following example: 

df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', pd.to_datetime('2018/01/02')])

table = pa.Table.from_pandas(df)
pq.write_table(table, 'test.parquet')

ref_df = pq.read_pandas('test.parquet').to_pandas()

# Index(['a string', 2018-01-02 00:00:00], dtype='object')
# Index(['a string', '2018-01-02 00:00:00'], dtype='object')

The serialized data frame has an index with a string and a datetime field (happened when resetting
the index of a formerly datetime only column).
When reading the string back the datetime is converted into a string.

When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
            b'pe": "object"}} before serializing and {{"pandas_type": "unicode", "numpy_'
            b'type": "object"}} after reading back. So the schema was aware of the mixed type
but did not store the actual types.

The same happens with other types like numbers as well. One can produce interesting situations:

{{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} can be written
but fails to be read back as the index is no more unique with '1' showing up two times.

IIf this is not a bug but expected maybe the user should be somehow warned that information
is lost? Like a {{NotImplemented}} exception.

This message was sent by Atlassian JIRA

View raw message