arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Williams <>
Subject [python] [iter_batches] Is there any value to an iterator based parquet reader in python?
Date Sun, 27 Jun 2021 13:22:51 GMT

I've found myself wondering if there is a use case for using the
iter_batches method in python as an iterator in a similar style to a
server-side cursor in Postgres. Right now you can use an iterator of record
batches, but I wondered if having some sort of python native iterator might
be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
batched iterator of native python objects?

Here is some example code that shows a similar result.

from itertools import chain
from typing import Tuple, Any

def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:

        record_batches =
parquet_file.iter_batches(batch_size=batch_size, columns=columns)

        # convert from columnar format of pyarrow arrays to a row
format of python objects (yields tuples)
        yield from chain.from_iterable(zip(*map(lambda col:
col.to_pylist(), batch.columns)) for batch in record_batches)

(or a gist if you prefer:

I realize arrow is a columnar format, but I wonder if having the buffered
row reading as a lazy iterator is a common enough use case with parquet +
object storage being so common as a database alternative.


Grant Williams
Machine Learning Engineer

View raw message