arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruben Laguna <ruben.lag...@gmail.com>
Subject pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
Date Fri, 05 Mar 2021 20:40:23 GMT
Hi,

I'm getting "CSV parser got out of sync with chunker", any idea on how to
troubleshoot this?
If I feed the original file it fails after 1477218 rows
if I remove the first line after the header then it fails after 2919443
rows
if I remove the first 2 lines after the header  then it fails after 55339
rows
if I remove the first 3 lines after the header then it fails after 8200437
rows
if I remove the first 4 line after the header then if fails after 1866573
rows
To me it doesn't make sense, the failure shows at different, seemly random
places.

What can be causing this?  source code below->



Traceback (most recent call last):
  File "pa_inspect.py", line 15, in <module>
    for b in reader:
  File "pyarrow/ipc.pxi", line 497, in __iter__
  File "pyarrow/ipc.pxi", line 531, in
pyarrow.lib.RecordBatchReader.read_next_batch
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parser got out of sync with chunker
in


import pyarrow as pa
from pyarrow import csv
import pyarrow.parquet as pq

#
http://arrow.apache.org/docs/python/generated/pyarrow.csv.open_csv.html#pyarrow.csv.open_csv
#
http://arrow.apache.org/docs/python/generated/pyarrow.csv.CSVStreamingReader.html
reader = csv.open_csv('inspect.csv')


# ParquetWriter :
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
# RecordBat
#
http://arrow.apache.org/docs/python/parquet.html#finer-grained-reading-and-writing
crow = 0
with pq.ParquetWriter('inspect.parquet', reader.schema) as writer:
    for b in reader:
        print(b.num_rows,b.num_columns)
        crow = crow + b.num_rows
        print(crow)
        writer.write_table(pa.Table.from_batches([b]))

-- 
/Rubén

Mime
View raw message