arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: bug? pyarrow.Table.from_pydict does not handle binary type correctly with embedded 00 bytes?
Date Wed, 04 Nov 2020 23:09:37 GMT
Seems a bit buggy, can you open a Jira issue? Thanks

On Wed, Nov 4, 2020 at 5:05 PM Jason Sachs <jmsachs@gmail.com> wrote:
>
> It looks like pyarrow.Table.from_pydict() cuts off binary data after an embedded 00 byte.
Is this a known bug?
>
> (py3) C:\>python
> Python 3.8.5 (default, Sep  3 2020, 21:29:08) [MSC v.1916 64 bit (AMD64)] :: Anaconda,
Inc. on win32
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import numpy as np
> >>> import pyarrow as pa
> >>>
> >>> data = np.array([b'', b'', b'', b'Foo!!', b'Bar!!',
> ..        b'\x00Baz!', b'half\x00baked', b''], dtype='|S13')
> >>> t = pa.Table.from_pydict({'data':data})
> >>> t.to_pandas()
>        data
> 0       b''
> 1       b''
> 2       b''
> 3  b'Foo!!'
> 4  b'Bar!!'
> 5       b''
> 6   b'half'
> 7       b''
> >>> import pandas as pd
> >>> pd.DataFrame(data)
>                   0
> 0               b''
> 1               b''
> 2               b''
> 3          b'Foo!!'
> 4          b'Bar!!'
> 5       b'\x00Baz!'
> 6  b'half\x00baked'
> 7               b''
> >>>

Mime
View raw message