arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Wickstrom <jwickst...@esri.com>
Subject [C++] Writing a large Arrow table to disk
Date Mon, 03 May 2021 18:24:00 GMT
Hello,

I am getting started and prototyping with the Arrow C++ API. Great stuff! I have some newbie
questions.

My first goal is to write a bunch of "records" to an ArrowTable on disk. The resulting file
can be bigger than RAM so I want to use a zero copy approach when it is subsequently consumed
in a Python script, to_pandas(), etc.

My prototype seems to be working. I build the table following the "Row to columnar conversion"
example. Then I open a FileOutputStream, get a RecordBatchWriter from arrow::ipc::NewFileWriter(),
and call WriteTable() on it to save the table to disk. I get my desired zero copy usage of
it in Python.

My question is how do I build that table on disk in a scalable way? Like when there are billions
of records? Maybe the ArrayBuilder() classes themselves are already managing the business
of spilling over available RAM? Do I need to point up front to a MemoryMappedFile or use a
MemoryManager to make this scale? Maybe I am done already, but it seems like I'll need to
break up the content added to the ArrayBuilder() classes while also only adding one chunk
to the ChunkedArray(s) for zero copy compatibility.

Please help me understand conceptually the best way to write out a bigger than RAM Arrow table.
A modified "Row to columnar conversion" example would be great as well if applicable.

Thank you in advance for your help!

Cheers,
Jeff

Mime
View raw message