arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Williams <gr...@grantwilliams.dev>
Subject [python] [iter_batches] Is there any value to an iterator based parquet reader in python?
Date Sun, 27 Jun 2021 13:22:51 GMT
Hello,

I've found myself wondering if there is a use case for using the
iter_batches method in python as an iterator in a similar style to a
server-side cursor in Postgres. Right now you can use an iterator of record
batches, but I wondered if having some sort of python native iterator might
be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
batched iterator of native python objects?

Here is some example code that shows a similar result.

from itertools import chain
from typing import Tuple, Any

def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:

        record_batches =
parquet_file.iter_batches(batch_size=batch_size, columns=columns)

        # convert from columnar format of pyarrow arrays to a row
format of python objects (yields tuples)
        yield from chain.from_iterable(zip(*map(lambda col:
col.to_pylist(), batch.columns)) for batch in record_batches)

(or a gist if you prefer:
https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)

I realize arrow is a columnar format, but I wonder if having the buffered
row reading as a lazy iterator is a common enough use case with parquet +
object storage being so common as a database alternative.

Thanks,
Grant

-- 
Grant Williams
Machine Learning Engineer
https://github.com/grantmwilliams/

Mime
View raw message