arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <>
Subject Re: Reading Parquet Files in Chunks?
Date Mon, 09 Dec 2019 11:39:52 GMT
There is but it's not exposed in Python yet

See the "batch_size" parameter of ArrowReaderProperties

and the GetRecordBatchReader method on parquet::arrow::FileReader.
There's some related work happening in the C++ Datasets project

I'd like to see batch-based reading refined and better documented both
in C++ and Python, this would be a nice project for a volunteer to
take on.

- Wes

On Sun, Dec 8, 2019 at 9:00 PM Zhuo Jia Dai <> wrote:
> For example, pandas's read_csv has a chunk_size argument which allows the read_csv to
return an iterator on the CSV file so we can read it in chunks.
> The Parquet format stores the data in chunks, but there isn't a documented way to read
in it chunks like read_csv.
> Is there a way to read parquet files in chunks?
> --
> ZJ

View raw message