arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: [C++] Writing a large Arrow table to disk
Date Wed, 05 May 2021 18:13:25 GMT
>
> Please confirm there is no way around this. If you want a table on disk to
> be zero copy compatible, you must build it from contiguous arrays in
> memory. You can’t build an infinite sized memory mapped, zero copy
> compatible file on disk. Right?
>

For Pandas conversion  I believe this is true, but I'm not an expert.  If
you can process the data incrementally then reading a record batch at a
time and converting to pandas should still be zero copy.

On Wed, May 5, 2021 at 8:46 AM Jeff Wickstrom <jwickstrom@esri.com> wrote:

> Hi Micah, All,
>
>
>
> Thank you for the tip to use WriteRecordBatch. I got that to work. But,
> probably no surprise, it leads to a table on disk that can’t be loaded zero
> copy. “pyarrow.lib.ArrowInvalid: Needed to copy 2 chunks with 0 nulls, but
> zero_copy_only was True”
>
>
>
> Please confirm there is no way around this. If you want a table on disk to
> be zero copy compatible, you must build it from contiguous arrays in
> memory. You can’t build an infinite sized memory mapped, zero copy
> compatible file on disk. Right?
>
>
>
> Does RecordBatchBuilder help with this? I haven’t figured out how to use
> that yet.
>
>
>
> I am using Arrow 1.0.1. Has anything changed the story on this topic in
> more recent releases? Seems we need to upgrade anyway.
>
>
>
> This is how I am using the RecordBatchWriter so far:
>
>
>
>                 // write current batch
>
>
>
>       arrow::ArrayDataVector arrayDataVetor{};
>
>       for (auto& builder: arrayBuilders)
>
>       {
>
>         std::shared_ptr<arrow::ArrayData> arrayData;
>
>         builder->FinishInternal(&arrayData);
>
>         arrayDataVetor.push_back(arrayData);
>
>       }
>
>
>
>       auto recordBatch = arrow::RecordBatch::Make(schema,
> totalRecordsAddedForCurrentBatch, arrayDataVetor);
>
>       auto writeRecordBatchStatus =
> pRecordBatchWriter->WriteRecordBatch(*recordBatch);
>
>       if (!writeRecordBatchStatus.ok())
>
>         return arrow::Status::ExecutionError("Failed to WriteRecordBatch");
>
>
>
> Thank you in advance for your insights. Confirming how to best approach
> this is a big help for our design in progress.
>
>
>
> Jeff
>
>
>
> *From:* Micah Kornfield <emkornfield@gmail.com>
> *Sent:* Monday, May 3, 2021 12:10 PM
> *To:* user@arrow.apache.org
> *Subject:* Re: [C++] Writing a large Arrow table to disk
>
>
>
> Hi Jeff,
>
> Maybe the ArrayBuilder() classes themselves are already managing the
> business of spilling over available RAM?
>
> They do not (unless the underlying allocator already does this).  Either
> way there would be a big memcopy/flush when writing to disk.
>
>
>
>
>
> Do I need to point up front to a MemoryMappedFile or use a MemoryManager
> to make this scale?
>
> Per above this could help but also might not.
>
>
>
> Maybe I am done already, but it seems like I’ll need to break up the
> content added to the ArrayBuilder() classes while also only adding one
> chunk to the ChunkedArray(s) for zero copy compatibility.
>
>
>
> Off the top of my head the best way of doing this if possible it to create
> either Smaller tables or Record batches and write them incrementally (with
> WriteTable or WriteRecordBatch [1])
>
>
>
> [1] https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch
> [arrow.apache.org]
> <https://urldefense.com/v3/__https:/arrow.apache.org/docs/cpp/api/ipc.html*_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch__;Iw!!CKZwjTOV!gWYjC-rByf1xdIP7MmM3Qt35N7SSVps28PdkA_kH3DfrgHw2l5e76iasnSMPCog$>
>
>
>
> On Mon, May 3, 2021 at 11:24 AM Jeff Wickstrom <jwickstrom@esri.com>
> wrote:
>
> Hello,
>
>
>
> I am getting started and prototyping with the Arrow C++ API. Great stuff!
> I have some newbie questions.
>
>
>
> My first goal is to write a bunch of “records” to an ArrowTable on disk.
> The resulting file can be bigger than RAM so I want to use a zero copy
> approach when it is subsequently consumed in a Python script, to_pandas(),
> etc.
>
>
>
> My prototype seems to be working. I build the table following the “Row to
> columnar conversion” example. Then I open a FileOutputStream, get a
> RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable()
> on it to save the table to disk. I get my desired zero copy usage of it in
> Python.
>
>
>
> My question is how do I build that table on disk in a scalable way? Like
> when there are billions of records? Maybe the ArrayBuilder() classes
> themselves are already managing the business of spilling over available
> RAM? Do I need to point up front to a MemoryMappedFile or use a
> MemoryManager to make this scale? Maybe I am done already, but it seems
> like I’ll need to break up the content added to the ArrayBuilder() classes
> while also only adding one chunk to the ChunkedArray(s) for zero copy
> compatibility.
>
>
>
> Please help me understand conceptually the best way to write out a bigger
> than RAM Arrow table. A modified “Row to columnar conversion” example would
> be great as well if applicable.
>
>
>
> Thank you in advance for your help!
>
>
>
> Cheers,
>
> Jeff
>
>

Mime
View raw message