arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
Date Sat, 06 Mar 2021 04:18:40 GMT
Hi Ruben,
I'm not an expert here, but is it possible the CSV has newlines inside
quotes or some oddity?  There are a lot of configuration options for Read
CSV and you might want to validate that the defaults are at the most
conservative settings.

-Micah

On Fri, Mar 5, 2021 at 12:40 PM Ruben Laguna <ruben.laguna@gmail.com> wrote:

> Hi,
>
> I'm getting "CSV parser got out of sync with chunker", any idea on how to
> troubleshoot this?
> If I feed the original file it fails after 1477218 rows
> if I remove the first line after the header then it fails after 2919443
> rows
> if I remove the first 2 lines after the header  then it fails after 55339
> rows
> if I remove the first 3 lines after the header then it fails after 8200437
> rows
> if I remove the first 4 line after the header then if fails after 1866573
> rows
> To me it doesn't make sense, the failure shows at different, seemly random
> places.
>
> What can be causing this?  source code below->
>
>
>
> Traceback (most recent call last):
>   File "pa_inspect.py", line 15, in <module>
>     for b in reader:
>   File "pyarrow/ipc.pxi", line 497, in __iter__
>   File "pyarrow/ipc.pxi", line 531, in
> pyarrow.lib.RecordBatchReader.read_next_batch
>   File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
> pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
> in
>
>
> import pyarrow as pa
> from pyarrow import csv
> import pyarrow.parquet as pq
>
> #
> http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
> #
> http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html
> reader = csv.open_csv('inspect.csv')
>
>
> # ParquetWriter :
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
> # RecordBat
> #
> http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
> crow = 0
> with pq.ParquetWriter('inspect.parquet', reader.schema) as writer:
>     for b in reader:
>         print(b.num_rows,b.num_columns)
>         crow = crow + b.num_rows
>         print(crow)
>         writer.write_table(pa.Table.from_batches([b]))
>
> --
> /Rubén
>

Mime
View raw message