From user-return-31-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Mon Oct 15 23:38:41 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id E585D180647 for ; Mon, 15 Oct 2018 23:38:40 +0200 (CEST) Received: (qmail 81150 invoked by uid 500); 15 Oct 2018 21:38:39 -0000 Mailing-List: contact user-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@arrow.apache.org Delivered-To: mailing list user@arrow.apache.org Received: (qmail 81140 invoked by uid 99); 15 Oct 2018 21:38:39 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Oct 2018 21:38:39 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 6F43C1808AE for ; Mon, 15 Oct 2018 21:38:39 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.869 X-Spam-Level: * X-Spam-Status: No, score=1.869 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id fkyC4dej3xLT for ; Mon, 15 Oct 2018 21:38:36 +0000 (UTC) Received: from mail-pf1-f173.google.com (mail-pf1-f173.google.com [209.85.210.173]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 77FD15F43C for ; Mon, 15 Oct 2018 21:38:36 +0000 (UTC) Received: by mail-pf1-f173.google.com with SMTP id j23-v6so10329723pfi.4 for ; Mon, 15 Oct 2018 14:38:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=JRelSI8OVHVHukD2Dt/9fxceqJw/yQUGjs9GMHxcdYE=; b=t+YnT0ft7qy0yEfy0QeGxoW9Md9kyQhvtc0s7ao+CN90BP/yMB+eV8PPG43gd1y5+A 8wn09OY0rA+UYZn5gzqla65IZPwbmvEu5CqluDQkK4q+oV1Bs0y82zac04pEYZxeThYA tb/VRebwgM46iuaqmgXnxANIOmz42ZJqj5bIb0nwXo3FUFMsDhqZ5WpVWTmV9SUISuQY L1Q6wOPaOk1o6HQj/uf0fu7p0K3ipmQ1GBb0bSRpNpDrvCx5MauoMlFRVxXSQ1MGAEvG FJvx2yzPvXXa0uyVP6+0UPB08oDV6fwL1GCah5NhB5KxbtET+QrC74gj4z3Offe5o8UP YoeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=JRelSI8OVHVHukD2Dt/9fxceqJw/yQUGjs9GMHxcdYE=; b=KhuCDWNwayv0dYDxyqmvMHPGgTZdSCPnQtCGbH7iQcTIhtRAA1NJXieGjYrg/Khpn4 9xD5NmmO5PjFLwqrIevNtBVWHoMM4kNko2vUjmUe2vj+gLZOf+B1vaClQHa3oCWEQ9p1 Dk0gEtPTORHAdpQRv7iXyeeOzXQgdjYto+026xCjccRlOrQsN1UmkuiS4RU6V7Og4Kgo TdN2Mef1W2qboAWfYGUBk9pc9j0f5HHGhxic7NwtAsvQLQ7EHDjSYYr9yzNJqG2kis0N ZDbqWBW6iuHmr6qtHvPIooBpa7ZXkVH5DtO9ERyjDBxgIHn91sR35R5B/h1Gc2YMksyt Wi+Q== X-Gm-Message-State: ABuFfogoaDatlpwvXaWTyORkRHBaWPhA7jTT7UIb+mvGfqX3c+EfNB13 ACuZ8/ZF1tuPDaSVi+PAytZ6GtpA1pG6T5IlMhyk1g== X-Google-Smtp-Source: ACcGV60L0oSkH7EfWG0CC7S6HAannWsKWmNdK5yryOb72PAxDflgsntPbe0lF7ffsfmJHN2x4uA90MjwJRCbhEqRVA0= X-Received: by 2002:a65:4d03:: with SMTP id i3-v6mr17615000pgt.239.1539639508463; Mon, 15 Oct 2018 14:38:28 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bipin Mathew Date: Mon, 15 Oct 2018 17:38:17 -0400 Message-ID: Subject: Re: Help with writing Apache Arrow tables to shared memory. To: user@arrow.apache.org Content-Type: multipart/alternative; boundary="000000000000fbf3ac05784b4116" --000000000000fbf3ac05784b4116 Content-Type: text/plain; charset="UTF-8" Good Evening Wes, Thank you for the pointers on how all the parts hang together. I will go through the documentation you referenced. I will be glad to contribute any "Hello World" programs I write for myself along the way for the developer wiki. Regards, Bipin On Fri, Oct 12, 2018 at 3:17 AM Wes McKinney wrote: > hi Bipin, > > On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew > wrote: > > > > Good Evening Everyone, > > > > Circling back to this ask. I just wanted to suggest, that instead of > a an email thread, it maybe more valuable for some of the noobs out there > like me, to have available an "Apache Arrow and Shared Memory", "Hello > World" example program on the developer wiki, possibly without the > additional complication of Plasma. > > Given the current project development / maintenance workload, I doubt > that that anyone can change their priorities right now to provide more > user documentation. If you would like to contribute some examples, > that would be very helpful. > > Relatedly: I'm actively seeking sponsorship / donations for my > not-for-profit development group Ursa Labs which just works on Apache > Arrow. The more sponsorship we have, the more "free" help we can > provide. > > > I managed to get many ancillary features of Apache Arrow working ( IPC > for example ), but have not quite closed the circle on the raison d'etre > for Apache Arrow, which is efficiently sharing tables and record batches in > shared memory. It is not even obvious to me, if it is possible to construct > the tables in shared memory or if they have to be copied there after being > constructed elsewhere. > > If the data is located in shared memory, and you have a metadata > descriptor (as defined in Message/Schema.fbs) describing the locations > of the component memory buffers, then you can use > arrow::ipc::ReadRecordBatch to do a zero copy "read" into an > arrow::RecordBatch. > > How the data gets to shared memory (whether materialized in RAM, then > copied, or materialized directly there) and how the metadata > descriptor is constructed may vary a lot. We have tried to make it > easy for the simple case where you write from RAM and then do > zero-copy read. Beyond that (i.e. if you want to avoid allocating any > RAM) I think you're going to need to dig into the details of the IPC > protocol. The buffers constituting a RecordBatch don't need to be > contiguous, for example. > > > > > I also happened to come across this, currently unanswered, question > on stack overflow which references an approach I was thinking about ( > basically create a shared memory subclass for MemoryPool ), but was not > sure that was the appropriate level of the stack at which to attack this > problem. > > > > > https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory-pool-in-external-memory > > This could be possible -- the implementation could end up being rather > complex, though (e.g. I could see the implementation of "Reallocate" > being tricky). > > The most reliable way to materialize directly into shared memory is to > determine the buffer sizes ahead of time, create a large enough shared > memory page, write data into it (while building your own metadata > descriptor -- probably need to use Flatbuffers directly), then put the > metadata descriptor somewhere. I'm not sure else what the Arrow > project could provide to make this process easier (one thing: we could > provide an API for building your own RecordBatch descriptors without > having to use Flatbuffers directly) > > Thanks > Wes > > > > > Another approach I was considering is subclassing form ResizeableBuffer, > but was not sure if that is the right method either since I was not sure if > I could construct tables in shared memory without copying. > > > > Thank you to this great community for all your help in this matter. I am > very excited about this project and its prospects. > > > > Regards, > > > > Bipin > > > > > > > > On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew > wrote: > >> > >> Totally understandable. Thank you Wes! We can continue this > correspondence there. Looking forward to the 0.11 release :-) > >> > >> Regards, > >> > >> Bipin > >> > >> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney > wrote: > >>> > >>> hi Bipin -- I will reply to your mail on the dev@ mailing list but it > >>> may take me some time. I'm traveling internationally to conferences > >>> and also have been focused on moving the 0.11 release forward. > >>> > >>> - Wes > >>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew > wrote: > >>> > > >>> > Good Morning Everyone, > >>> > > >>> > I originally posted this question to the dev channel, not > knowing a user channel was available. This channel is more probably more > appropriate and I am hoping the kind souls here can help me. How, > fundamentally, are we expected, to copy or indeed directly write a arrow > table to shared memory using the cpp sdk? Currently, I have an > implementation like this: > >>> > > >>> >> 77 std::shared_ptr B; > >>> >> 78 std::shared_ptr buffer; > >>> >> 79 std::shared_ptr writer; > >>> >> 80 arrow::MemoryPool* pool = arrow::default_memory_pool(); > >>> >> 81 arrow::io::BufferOutputStream::Create(4096,pool,&buffer); > >>> >> 82 std::shared_ptr table; > >>> >> 83 karrow::ArrowHandle *h; > >>> >> 84 h = (karrow::ArrowHandle *)Kj(khandle); > >>> >> 85 table = h->table; > >>> >> 86 > >>> >> 87 > arrow::ipc::RecordBatchStreamWriter::Open(buffer.get(),table->schema(),&writer); > >>> >> 88 writer->WriteTable(*table); > >>> >> 89 writer->Close(); > >>> >> 90 buffer->Finish(&B); > >>> >> 91 > >>> >> 92 // printf("Investigate Memory usage."); > >>> >> 93 // getchar(); > >>> >> 94 > >>> >> 95 > >>> >> 96 std::shared_ptr mm; > >>> >> 97 > arrow::io::MemoryMappedFile::Create("/dev/shm/arrow_table",B->size(),&mm); > >>> >> 98 mm->Write(B->data(),B->size()); > >>> >> 99 mm->Close(); > >>> > > >>> > > >>> > "table" on line 85 is a shared_ptr to a arrow::Table object. As you > can see there, I write to an arrow:Buffer then write that to a memory > mapped file. Is there a more direct approach? I watched this video of a > talk @Wes McKinney gave here: > >>> > > >>> > https://www.dremio.com/webinars/arrow-c++-roadmap-and-pandas2/ > >>> > > >>> > Where a method: arrow::MemoryMappedBuffer was referenced, but I have > not seen any documentation regarding this function. Has it been deprecated? > >>> > > >>> > Also, as I mentioned, "table" up there is a arrow::Table object. I > create it columnwise using various arrow::[type]Builder functions. Is there > anyway to actually even write the original table directly into shared > memory? Any guidance on the proper way to do these things would be greatly > appreciated. > >>> > > >>> > Regards, > >>> > > >>> > Bipin > --000000000000fbf3ac05784b4116 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Good E= vening Wes,

=C2=A0 =C2=A0 =C2=A0Thank you for the = pointers on how all the parts hang together. I will go through the document= ation you referenced. I will be glad to contribute any "Hello World&qu= ot; programs I write for myself along the way for the developer wiki.=C2=A0=

Regards,

Bipin



On Fri, Oct 12, 2018 at 3:17 AM Wes McKinney <wesmckinn@gmail.com> wrote:
=
hi Bipin,

On Fri, Oct 12, 2018 at 12:09 AM Bipin Mathew <bipinmathew@gmail.com> wrote:
>
> Good Evening Everyone,
>
>=C2=A0 =C2=A0 =C2=A0Circling back to this ask. I just wanted to suggest= , that instead of a an email thread, it maybe more valuable for some of the= noobs out there like me, to have available an "Apache Arrow and Share= d Memory", "Hello World" example program on the developer wi= ki, possibly without the additional complication of Plasma.

Given the current project development / maintenance workload, I doubt
that that anyone can change their priorities right now to provide more
user documentation. If you would like to contribute some examples,
that would be very helpful.

Relatedly: I'm actively seeking sponsorship / donations for my
not-for-profit development group Ursa Labs which just works on Apache
Arrow. The more sponsorship we have, the more "free" help we can<= br> provide.

> I managed to get many ancillary features of Apache Arrow working ( IPC= for example ), but have not quite closed the circle on the raison d'et= re for Apache Arrow, which is efficiently sharing tables and record batches= in shared memory. It is not even obvious to me, if it is possible to const= ruct the tables in shared memory or if they have to be copied there after b= eing constructed elsewhere.

If the data is located in shared memory, and you have a metadata
descriptor (as defined in Message/Schema.fbs) describing the locations
of the component memory buffers, then you can use
arrow::ipc::ReadRecordBatch to do a zero copy "read" into an
arrow::RecordBatch.

How the data gets to shared memory (whether materialized in RAM, then
copied, or materialized directly there) and how the metadata
descriptor is constructed may vary a lot. We have tried to make it
easy for the simple case where you write from RAM and then do
zero-copy read. Beyond that (i.e. if you want to avoid allocating any
RAM) I think you're going to need to dig into the details of the IPC protocol. The buffers constituting a RecordBatch don't need to be
contiguous, for example.

>
>=C2=A0 =C2=A0 =C2=A0I also happened to come across this, currently unan= swered, question on stack overflow which references an approach I was think= ing about ( basically create a shared memory subclass for MemoryPool ), but= was not sure that was the appropriate level of the stack at which to attac= k this problem.
>
> https://stackoverflow.com/questions/52673910/allocate-apache-arrow-memory= -pool-in-external-memory

This could be possible -- the implementation could end up being rather
complex, though (e.g. I could see the implementation of "Reallocate&qu= ot;
being tricky).

The most reliable way to materialize directly into shared memory is to
determine the buffer sizes ahead of time, create a large enough shared
memory page, write data into it (while building your own metadata
descriptor -- probably need to use Flatbuffers directly), then put the
metadata descriptor somewhere. I'm not sure else what the Arrow
project could provide to make this process easier (one thing: we could
provide an API for building your own RecordBatch descriptors without
having to use Flatbuffers directly)

Thanks
Wes

>
> Another approach I was considering is subclassing form ResizeableBuffe= r, but was not sure if that is the right method either since I was not sure= if I could construct tables in shared memory without copying.
>
> Thank you to this great community for all your help in this matter. I = am very excited about this project and its prospects.
>
> Regards,
>
> Bipin
>
>
>
> On Wed, Oct 3, 2018 at 4:37 PM Bipin Mathew <bipinmathew@gmail.com> wrote: >>
>> Totally understandable. Thank you Wes! We can continue this corres= pondence there. Looking forward to the 0.11 release :-)
>>
>> Regards,
>>
>> Bipin
>>
>> On Wed, Oct 3, 2018 at 4:22 PM Wes McKinney <wesmckinn@gmail.com> wrote: >>>
>>> hi Bipin -- I will reply to your mail on the dev@ mailing list= but it
>>> may take me some time. I'm traveling internationally to co= nferences
>>> and also have been focused on moving the 0.11 release forward.=
>>>
>>> - Wes
>>> On Wed, Oct 3, 2018 at 12:00 PM Bipin Mathew <bipinmathew@gmail.com>= wrote:
>>> >
>>> > Good Morning Everyone,
>>> >
>>> >=C2=A0 =C2=A0 =C2=A0I originally posted this question to t= he dev channel, not knowing a user channel was available. This channel is m= ore probably more appropriate and I am hoping the kind souls here can help = me. How, fundamentally, are we expected, to copy or indeed directly write a= arrow table to shared memory using the cpp sdk? Currently, I have an imple= mentation like this:
>>> >
>>> >>=C2=A0 77=C2=A0 =C2=A0std::shared_ptr<arrow::Buffer= > B;
>>> >>=C2=A0 78=C2=A0 =C2=A0std::shared_ptr<arrow::io::Bu= fferOutputStream> buffer;
>>> >>=C2=A0 79=C2=A0 =C2=A0std::shared_ptr<arrow::ipc::R= ecordBatchWriter> writer;
>>> >>=C2=A0 80=C2=A0 =C2=A0arrow::MemoryPool* pool =3D arro= w::default_memory_pool();
>>> >>=C2=A0 81=C2=A0 =C2=A0arrow::io::BufferOutputStream::C= reate(4096,pool,&buffer);
>>> >>=C2=A0 82=C2=A0 =C2=A0std::shared_ptr<arrow::Table&= gt; table;
>>> >>=C2=A0 83=C2=A0 =C2=A0karrow::ArrowHandle *h;
>>> >>=C2=A0 84=C2=A0 =C2=A0h =3D (karrow::ArrowHandle *)Kj(= khandle);
>>> >>=C2=A0 85=C2=A0 =C2=A0table =3D h->table;
>>> >>=C2=A0 86
>>> >>=C2=A0 87=C2=A0 =C2=A0arrow::ipc::RecordBatchStreamWri= ter::Open(buffer.get(),table->schema(),&writer);
>>> >>=C2=A0 88=C2=A0 =C2=A0writer->WriteTable(*table); >>> >>=C2=A0 89=C2=A0 =C2=A0writer->Close();
>>> >>=C2=A0 90=C2=A0 =C2=A0buffer->Finish(&B);
>>> >>=C2=A0 91
>>> >>=C2=A0 92=C2=A0 =C2=A0// printf("Investigate Memo= ry usage.");
>>> >>=C2=A0 93=C2=A0 =C2=A0// getchar();
>>> >>=C2=A0 94
>>> >>=C2=A0 95
>>> >>=C2=A0 96=C2=A0 =C2=A0std::shared_ptr<arrow::io::Me= moryMappedFile> mm;
>>> >>=C2=A0 97=C2=A0 =C2=A0arrow::io::MemoryMappedFile::Cre= ate("/dev/shm/arrow_table",B->size(),&mm);
>>> >>=C2=A0 98=C2=A0 =C2=A0mm->Write(B->data(),B->= size());
>>> >>=C2=A0 99=C2=A0 =C2=A0mm->Close();
>>> >
>>> >
>>> > "table" on line 85 is a shared_ptr to a arrow::= Table object. As you can see there, I write to an arrow:Buffer then write t= hat to a memory mapped file. Is there a more direct approach? I watched thi= s video of a talk @Wes McKinney gave here:
>>> >
>>> > https://www.dremio.c= om/webinars/arrow-c++-roadmap-and-pandas2/
>>> >
>>> > Where a method: arrow::MemoryMappedBuffer was referenced,= but I have not seen any documentation regarding this function. Has it been= deprecated?
>>> >
>>> > Also, as I mentioned, "table" up there is a arr= ow::Table object. I create it columnwise using various arrow::[type]Builder= functions. Is there anyway to actually even write the original table direc= tly into shared memory? Any guidance on the proper way to do these things w= ould be greatly appreciated.
>>> >
>>> > Regards,
>>> >
>>> > Bipin
--000000000000fbf3ac05784b4116--