arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bipin Mathew <bipinmat...@gmail.com>
Subject Re: Help with writing Apache Arrow tables to shared memory.
Date Mon, 15 Oct 2018 21:38:17 GMT
Good Evening Wes,

     Thank you for the pointers on how all the parts hang together. I will
go through the documentation you referenced. I will be glad to contribute
any "Hello World" programs I write for myself along the way for the
developer wiki.

Regards,

Bipin



On Fri, Oct 12, 2018 at 3:17 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Bipin,
>
> On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew <bipinmathew@gmail.com>
> wrote:
> >
> > Good Evening Everyone,
> >
> >     Circling back to this ask. I just wanted to suggest, that instead of
> a an email thread, it maybe more valuable for some of the noobs out there
> like me, to have available an "Apache Arrow and Shared Memory", "Hello
> World" example program on the developer wiki, possibly without the
> additional complication of Plasma.
>
> Given the current project development / maintenance workload, I doubt
> that that anyone can change their priorities right now to provide more
> user documentation. If you would like to contribute some examples,
> that would be very helpful.
>
> Relatedly: I'm actively seeking sponsorship / donations for my
> not-for-profit development group Ursa Labs which just works on Apache
> Arrow. The more sponsorship we have, the more "free" help we can
> provide.
>
> > I managed to get many ancillary features of Apache Arrow working ( IPC
> for example ), but have not quite closed the circle on the raison d'etre
> for Apache Arrow, which is efficiently sharing tables and record batches in
> shared memory. It is not even obvious to me, if it is possible to construct
> the tables in shared memory or if they have to be copied there after being
> constructed elsewhere.
>
> If the data is located in shared memory, and you have a metadata
> descriptor (as defined in Message/Schema.fbs) describing the locations
> of the component memory buffers, then you can use
> arrow::ipc::ReadRecordBatch to do a zero copy "read" into an
> arrow::RecordBatch.
>
> How the data gets to shared memory (whether materialized in RAM, then
> copied, or materialized directly there) and how the metadata
> descriptor is constructed may vary a lot. We have tried to make it
> easy for the simple case where you write from RAM and then do
> zero-copy read. Beyond that (i.e. if you want to avoid allocating any
> RAM) I think you're going to need to dig into the details of the IPC
> protocol. The buffers constituting a RecordBatch don't need to be
> contiguous, for example.
>
> >
> >     I also happened to come across this, currently unanswered, question
> on stack overflow which references an approach I was thinking about (
> basically create a shared memory subclass for MemoryPool ), but was not
> sure that was the appropriate level of the stack at which to attack this
> problem.
> >
> >
> https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory-pool-in-external-memory
>
> This could be possible -- the implementation could end up being rather
> complex, though (e.g. I could see the implementation of "Reallocate"
> being tricky).
>
> The most reliable way to materialize directly into shared memory is to
> determine the buffer sizes ahead of time, create a large enough shared
> memory page, write data into it (while building your own metadata
> descriptor -- probably need to use Flatbuffers directly), then put the
> metadata descriptor somewhere. I'm not sure else what the Arrow
> project could provide to make this process easier (one thing: we could
> provide an API for building your own RecordBatch descriptors without
> having to use Flatbuffers directly)
>
> Thanks
> Wes
>
> >
> > Another approach I was considering is subclassing form ResizeableBuffer,
> but was not sure if that is the right method either since I was not sure if
> I could construct tables in shared memory without copying.
> >
> > Thank you to this great community for all your help in this matter. I am
> very excited about this project and its prospects.
> >
> > Regards,
> >
> > Bipin
> >
> >
> >
> > On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew <bipinmathew@gmail.com>
> wrote:
> >>
> >> Totally understandable. Thank you Wes! We can continue this
> correspondence there. Looking forward to the 0.11 release :-)
> >>
> >> Regards,
> >>
> >> Bipin
> >>
> >> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >>>
> >>> hi Bipin -- I will reply to your mail on the dev@ mailing list but it
> >>> may take me some time. I'm traveling internationally to conferences
> >>> and also have been focused on moving the 0.11 release forward.
> >>>
> >>> - Wes
> >>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew <bipinmathew@gmail.com>
> wrote:
> >>> >
> >>> > Good Morning Everyone,
> >>> >
> >>> >     I originally posted this question to the dev channel, not
> knowing a user channel was available. This channel is more probably more
> appropriate and I am hoping the kind souls here can help me. How,
> fundamentally, are we expected, to copy or indeed directly write a arrow
> table to shared memory using the cpp sdk? Currently, I have an
> implementation like this:
> >>> >
> >>> >>  77   std::shared_ptr<arrow::Buffer> B;
> >>> >>  78   std::shared_ptr<arrow::io::BufferOutputStream> buffer;
> >>> >>  79   std::shared_ptr<arrow::ipc::RecordBatchWriter> writer;
> >>> >>  80   arrow::MemoryPool* pool = arrow::default_memory_pool();
> >>> >>  81   arrow::io::BufferOutputStream::Create(4096,pool,&buffer);
> >>> >>  82   std::shared_ptr<arrow::Table> table;
> >>> >>  83   karrow::ArrowHandle *h;
> >>> >>  84   h = (karrow::ArrowHandle *)Kj(khandle);
> >>> >>  85   table = h->table;
> >>> >>  86
> >>> >>  87
>  arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer);
> >>> >>  88   writer->WriteTable(*table);
> >>> >>  89   writer->Close();
> >>> >>  90   buffer->Finish(&B);
> >>> >>  91
> >>> >>  92   // printf("Investigate Memory usage.");
> >>> >>  93   // getchar();
> >>> >>  94
> >>> >>  95
> >>> >>  96   std::shared_ptr<arrow::io::MemoryMappedFile> mm;
> >>> >>  97
>  arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm);
> >>> >>  98   mm->Write(B->data(),B->size());
> >>> >>  99   mm->Close();
> >>> >
> >>> >
> >>> > "table" on line 85 is a shared_ptr to a arrow::Table object. As you
> can see there, I write to an arrow:Buffer then write that to a memory
> mapped file. Is there a more direct approach? I watched this video of a
> talk @Wes McKinney gave here:
> >>> >
> >>> > https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/
> >>> >
> >>> > Where a method: arrow::MemoryMappedBuffer was referenced, but I have
> not seen any documentation regarding this function. Has it been deprecated?
> >>> >
> >>> > Also, as I mentioned, "table" up there is a arrow::Table object. I
> create it columnwise using various arrow::[type]Builder functions. Is there
> anyway to actually even write the original table directly into shared
> memory? Any guidance on the proper way to do these things would be greatly
> appreciated.
> >>> >
> >>> > Regards,
> >>> >
> >>> > Bipin
>

Mime
View raw message