I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects?

Here is some example code that shows a similar result.
from itertools import chain
from typing import Tuple, Any

def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:

        record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)

        # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
        yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
(or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)

I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative.


Grant Williams
Machine Learning Engineer