arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From filippo medri <filippo.me...@gmail.com>
Subject [pyarrow] How to enable memory mapping in pyarrow.parquet.read_table
Date Thu, 27 Feb 2020 22:14:18 GMT
Hi,
experimenting with :

import pyarrow as pa
import pyarrow.parquet as pq
table = pq.read_table(source,memory_mapped=True)
mem_bytes = pa.total_allocated_bytes()

I have observed that mem_bytes is about the size of the parquet file on
disk.
If I remove the assignment and execute
pq.read_table(source,memory_mapped=True)
mem_bytes = pa.total_allocated_bytes()

mem_bytes is 0

Environment is Ubuntu 16, python 2.7.17, pyarrow 0.16.0 installed with pip
install, the parquet file
is made by saving 4 numpy arrays of doubles to an arrow table and then
saving them to parquet with the write_table function.

My goal is to read the parquet file in a memory mapped table and than
reading it a record batch at a time, with:
batches = tables.to_batches()
for batch in batches:
   # do something with the batch then save it to disk

At the present time I am able to load a parquet file in an arrow table,
split it to batches, add columns and then write each RecordBatch to a
parquet file, but the read_table function seems to be loading all data into
memory.

Is there a way to load a parquet file in a table in memory a record batch
at a time? Or just stream RecordBatch from a parquet file without loading
all the content in memory?

Thanks in advance,
Filippo Medri

Mime
View raw message