arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-434) Segfaults and encoding issues in Python Parquet reads
Date Tue, 20 Dec 2016 16:34:58 GMT

    [ https://issues.apache.org/jira/browse/ARROW-434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764599#comment-15764599
] 

Wes McKinney commented on ARROW-434:
------------------------------------

PR: https://github.com/apache/arrow/pull/247

When PARQUET-812 is in, I'll update the conda-forge artifacts so you can verify the use case
on your environment

> Segfaults and encoding issues in Python Parquet reads
> -----------------------------------------------------
>
>                 Key: ARROW-434
>                 URL: https://issues.apache.org/jira/browse/ARROW-434
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>         Environment: Ubuntu, Python 3.5, installed pyarrow from conda-forge
>            Reporter: Matthew Rocklin
>            Assignee: Wes McKinney
>            Priority: Minor
>              Labels: parquet, python
>
> I've conda installed pyarrow and am trying to read data from the parquet-compatibility
project.  I haven't explicitly built parquet-cpp or anything and may or may not have old versions
lying around, so please take this issue with some salt:
> {code:python}
> In [1]: import pyarrow.parquet
> In [2]: t = pyarrow.parquet.read_table('nation.plain.parquet')
> ---------------------------------------------------------------------------
> ArrowException                            Traceback (most recent call last)
> <ipython-input-2-5d966681a384> in <module>()
> ----> 1 t = pyarrow.parquet.read_table('nation.plain.parquet')
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.read_table
(/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2783)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/parquet.pyx in pyarrow.parquet.ParquetReader.read_all
(/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/parquet.cxx:2200)()
> /home/mrocklin/Software/anaconda/lib/python3.5/site-packages/pyarrow/error.pyx in pyarrow.error.check_status
(/feedstock_root/build_artefacts/work/arrow-79344b335849c2eb43954b0751018051814019d6/python/build/temp.linux-x86_64-3.5/error.cxx:1185)()
> ArrowException: NotImplemented: list<: uint8>
> {code}
> Additionally I tried to read data from a Python file-like object pointing to data on
S3.  Let me know if you'd prefer a separate issue.
> {code:python}
> In [1]: import s3fs
> In [2]: fs = s3fs.S3FileSystem()
> In [3]: f = fs.open('dask-data/nyc-taxi/2015/parquet/part.0.parquet')
> In [4]: f.read(100)
> Out[4]: b'PAR1\x15\x00\x15\x90\xc4\xa2\x12\x15\x90\xc4\xa2\x12,\x15\xc2\xa8\xa4\x02\x15\x00\x15\x06\x15\x08\x00\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00\x00\x80\xbf\xe7\x8b\x0b\x05\x00@\xc2\xce\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\xc0F\xed\xe7\x8b\x0b\x05\x00\x00\x89\xfc\xe7\x8b\x0b\x05\x00@\xcb\x0b\xe8\x8b\x0b\x05\x00\x80\r\x1b\xe8\x8b\x0b'
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> Here is a more reproducible version:
> {code:python}
> In [1]: with open('nation.plain.parquet', 'rb') as f:
>    ...:     data = f.read()
>    ...:     
> In [2]: from io import BytesIO
> In [3]: f = BytesIO(data)
> In [4]: f.seek(0)
> Out[4]: 0
> In [5]: import pyarrow.parquet
> In [6]: t = pyarrow.parquet.read_table(f)
> Segmentation fault (core dumped)
> {code}
> I was however pleased with round-trip functionality within this project, which was very
pleasant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message