arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (ARROW-436) [Python] pandas-parquet roundtrip dtype mismatch
Date Tue, 20 Dec 2016 16:54:58 GMT

     [ https://issues.apache.org/jira/browse/ARROW-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wes McKinney closed ARROW-436.
------------------------------
    Resolution: Not A Bug

This is a type fidelity issue for the Parquet 1.0 format (because we don't have specific-integer
logical types)

> [Python] pandas-parquet roundtrip dtype mismatch
> ------------------------------------------------
>
>                 Key: ARROW-436
>                 URL: https://issues.apache.org/jira/browse/ARROW-436
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Wes McKinney
>
> As a follow up to ARROW-434, I observed the following odd failure:
> {code}
> @parquet
> def test_pandas_parquet_pyfile_failure(tmpdir):
>     filename = tmpdir.join('pandas_pyfile_roundtrip.parquet').strpath
>     size = 5
>     np.random.seed(0)
>     df = pd.DataFrame({
>         'uint8': np.arange(size, dtype=np.uint8),
>         'uint16': np.arange(size, dtype=np.uint16),
>         'uint32': np.arange(size, dtype=np.uint32),
>         'uint64': np.arange(size, dtype=np.uint64),
>         'int8': np.arange(size, dtype=np.int16),
>         'int16': np.arange(size, dtype=np.int16),
>         'int32': np.arange(size, dtype=np.int32),
>         'int64': np.arange(size, dtype=np.int64),
>         'float32': np.arange(size, dtype=np.float32),
>         'float64': np.arange(size, dtype=np.float64),
>         'bool': np.random.randn(size) > 0
>     })
>     arrow_table = A.from_pandas_dataframe(df)
>     with open(filename, 'wb') as f:
>         A.parquet.write_table(arrow_table, f, version="1.0")
>     data = io.BytesIO(open(filename, 'rb').read())
>     table_read = pq.read_table(data)
>     df_read = table_read.to_pandas()
>     pdt.assert_frame_equal(df, df_read)
> {code}
> I see debugging locally:
> {code}
> (Pdb) df.dtypes
> bool          bool
> float32    float32
> float64    float64
> int16        int16
> int32        int32
> int64        int64
> int8         int16
> uint16      uint16
> uint32      uint32
> uint64      uint64
> uint8        uint8
> dtype: object
> (Pdb) df_read.dtypes
> bool          bool
> float32    float32
> float64    float64
> int16        int16
> int32        int32
> int64        int64
> int8         int16
> uint16      uint16
> uint32       int64
> uint64      uint64
> uint8        uint8
> dtype: object
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message