arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Help with writing Apache Arrow tables to shared memory.
Date Fri, 12 Oct 2018 07:17:10 GMT
hi Bipin,

On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew <bipinmathew@gmail.com> wrote:
>
> Good Evening Everyone,
>
>     Circling back to this ask. I just wanted to suggest, that instead of a an email thread,
it maybe more valuable for some of the noobs out there like me, to have available an "Apache
Arrow and Shared Memory", "Hello World" example program on the developer wiki, possibly without
the additional complication of Plasma.

Given the current project development / maintenance workload, I doubt
that that anyone can change their priorities right now to provide more
user documentation. If you would like to contribute some examples,
that would be very helpful.

Relatedly: I'm actively seeking sponsorship / donations for my
not-for-profit development group Ursa Labs which just works on Apache
Arrow. The more sponsorship we have, the more "free" help we can
provide.

> I managed to get many ancillary features of Apache Arrow working ( IPC for example ),
but have not quite closed the circle on the raison d'etre for Apache Arrow, which is efficiently
sharing tables and record batches in shared memory. It is not even obvious to me, if it is
possible to construct the tables in shared memory or if they have to be copied there after
being constructed elsewhere.

If the data is located in shared memory, and you have a metadata
descriptor (as defined in Message/Schema.fbs) describing the locations
of the component memory buffers, then you can use
arrow::ipc::ReadRecordBatch to do a zero copy "read" into an
arrow::RecordBatch.

How the data gets to shared memory (whether materialized in RAM, then
copied, or materialized directly there) and how the metadata
descriptor is constructed may vary a lot. We have tried to make it
easy for the simple case where you write from RAM and then do
zero-copy read. Beyond that (i.e. if you want to avoid allocating any
RAM) I think you're going to need to dig into the details of the IPC
protocol. The buffers constituting a RecordBatch don't need to be
contiguous, for example.

>
>     I also happened to come across this, currently unanswered, question on stack overflow
which references an approach I was thinking about ( basically create a shared memory subclass
for MemoryPool ), but was not sure that was the appropriate level of the stack at which to
attack this problem.
>
> https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory-pool-in-external-memory

This could be possible -- the implementation could end up being rather
complex, though (e.g. I could see the implementation of "Reallocate"
being tricky).

The most reliable way to materialize directly into shared memory is to
determine the buffer sizes ahead of time, create a large enough shared
memory page, write data into it (while building your own metadata
descriptor -- probably need to use Flatbuffers directly), then put the
metadata descriptor somewhere. I'm not sure else what the Arrow
project could provide to make this process easier (one thing: we could
provide an API for building your own RecordBatch descriptors without
having to use Flatbuffers directly)

Thanks
Wes

>
> Another approach I was considering is subclassing form ResizeableBuffer, but was not
sure if that is the right method either since I was not sure if I could construct tables in
shared memory without copying.
>
> Thank you to this great community for all your help in this matter. I am very excited
about this project and its prospects.
>
> Regards,
>
> Bipin
>
>
>
> On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew <bipinmathew@gmail.com> wrote:
>>
>> Totally understandable. Thank you Wes! We can continue this correspondence there.
Looking forward to the 0.11 release :-)
>>
>> Regards,
>>
>> Bipin
>>
>> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>>>
>>> hi Bipin -- I will reply to your mail on the dev@ mailing list but it
>>> may take me some time. I'm traveling internationally to conferences
>>> and also have been focused on moving the 0.11 release forward.
>>>
>>> - Wes
>>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew <bipinmathew@gmail.com> wrote:
>>> >
>>> > Good Morning Everyone,
>>> >
>>> >     I originally posted this question to the dev channel, not knowing a
user channel was available. This channel is more probably more appropriate and I am hoping
the kind souls here can help me. How, fundamentally, are we expected, to copy or indeed directly
write a arrow table to shared memory using the cpp sdk? Currently, I have an implementation
like this:
>>> >
>>> >>  77   std::shared_ptr<arrow::Buffer> B;
>>> >>  78   std::shared_ptr<arrow::io::BufferOutputStream> buffer;
>>> >>  79   std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
>>> >>  80   arrow::MemoryPool* pool = arrow::default_memory_pool();
>>> >>  81   arrow::io::BufferOutputStream::Create(4096,pool,&buffer);
>>> >>  82   std::shared_ptr<arrow::Table> table;
>>> >>  83   karrow::ArrowHandle *h;
>>> >>  84   h = (karrow::ArrowHandle *)Kj(khandle);
>>> >>  85   table = h->table;
>>> >>  86
>>> >>  87   arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer);
>>> >>  88   writer->WriteTable(*table);
>>> >>  89   writer->Close();
>>> >>  90   buffer->Finish(&B);
>>> >>  91
>>> >>  92   // printf("Investigate Memory usage.");
>>> >>  93   // getchar();
>>> >>  94
>>> >>  95
>>> >>  96   std::shared_ptr<arrow::io::MemoryMappedFile> mm;
>>> >>  97   arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm);
>>> >>  98   mm->Write(B->data(),B->size());
>>> >>  99   mm->Close();
>>> >
>>> >
>>> > "table" on line 85 is a shared_ptr to a arrow::Table object. As you can
see there, I write to an arrow:Buffer then write that to a memory mapped file. Is there a
more direct approach? I watched this video of a talk @Wes McKinney gave here:
>>> >
>>> > https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/
>>> >
>>> > Where a method: arrow::MemoryMappedBuffer was referenced, but I have not
seen any documentation regarding this function. Has it been deprecated?
>>> >
>>> > Also, as I mentioned, "table" up there is a arrow::Table object. I create
it columnwise using various arrow::[type]Builder functions. Is there anyway to actually even
write the original table directly into shared memory? Any guidance on the proper way to do
these things would be greatly appreciated.
>>> >
>>> > Regards,
>>> >
>>> > Bipin

Mime
View raw message