arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Reading Parquet Files in Chunks?
Date Mon, 09 Dec 2019 11:39:52 GMT
There is but it's not exposed in Python yet

See the "batch_size" parameter of ArrowReaderProperties

https://github.com/apache/arrow/blob/master/cpp/src/parquet/properties.h#L565

and the GetRecordBatchReader method on parquet::arrow::FileReader.
There's some related work happening in the C++ Datasets project

I'd like to see batch-based reading refined and better documented both
in C++ and Python, this would be a nice project for a volunteer to
take on.

- Wes

On Sun, Dec 8, 2019 at 9:00 PM Zhuo Jia Dai <zhuojia.dai@gmail.com> wrote:
>
>
> For example, pandas's read_csv has a chunk_size argument which allows the read_csv to
return an iterator on the CSV file so we can read it in chunks.
>
> The Parquet format stores the data in chunks, but there isn't a documented way to read
in it chunks like read_csv.
>
> Is there a way to read parquet files in chunks?
>
> --
> ZJ
>
> zhuojia.dai@gmail.com

Mime
View raw message