arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Question about mutability
Date Thu, 25 Feb 2016 19:35:19 GMT
One thing to keep in mind is that shared memory is not a performance panacea.

We did some experimentation (actually, an intern on our team did --
credit where credit is due) with shared memory transport between the
Kudu C++ client and server. What we found was that, for single-batch
transfers, shared memory was no faster than unix domain sockets (which
were measurably but not a lot faster than TCP sockets). The issue is
that, when a new process mmaps a shared memory segment, it still has
to take minor page faults to set up page table entries on every page
it hits.

Shared memory with huge pages is one potential solution, but relies on
either a fixed hugepage allocation (pain to configure) or hoping that
background THP compaction is keeping up. Additionally, tmpfs-based
shared memory doesn't support huge pages in current kernel versions
AFAIK, so you're stuck with sysv shared memory, which is a bit more
painful to use from Java.

If you're able to reuse a shared memory region across multiple data
transfers, the improvements are substantial, though, so it's not a
dead end. It's just most useful for the case where you're streaming a
bunch of *batches* of data (eg 8M each) between two processes, and can
have some way of coordinating between them.

Another difficulty is security -- with a format like Arrow that embeds
internal pointers, any process following the pointers needs to either
trust the other processes with write access to the segment, or needs
to carefully verify any pointers before following them. Otherwise,
it's pretty easy to crash the reader if you're a malicious writer. In
the case of using arrow from a shared daemon (eg impalad) to access an
untrusted UDF (eg python running in a setuid task container), this is
a real concern IMO.

-Todd



On Thu, Feb 25, 2016 at 8:37 AM, Wes McKinney <wes@cloudera.com> wrote:
> hi Leif -- you've articulated almost exactly my vision for pandas
> interoperability with Spark via Arrow. There are some questions to sort
> out, like shared memory / mmap management and security / sandboxing
> questions, but in general moving toward a model where RPC's contain shared
> memory offsets to Arrow data rather than shipping large chunks of
> serialized data through stdin/stdout would be a huge leap forward.
>
> On Wed, Feb 24, 2016 at 4:26 PM, Leif Walsh <leif.walsh@gmail.com> wrote:
>
>> Here's the image that popped into my mind when I heard about this project,
>> at best it's a motivating example, at worst it's a distraction:
>>
>> 1. Spark reads parquet from wherever into an arrow structure in shared
>> memory.
>> 2. Spark executor calls into the Python half of pyspark with a handle to
>> this memory.
>> 3. Pandas computes some summary over this data in place, produces some
>> smaller result set, also in memory, also in arrow format.
>> 4. Spark driver collects results (without ser/de overheads, just straight
>> byte buffer reads across the network), concatenates them in memory.
>> 5. Spark driver hands that memory to local pandas, which does more
>> computation over that in place.
>>
>> This is what got me excited, it kills a lot of the spark overheads related
>> to data (and for a kicker, also enables full pandas compatibility along
>> with statsmodels/scikit/etc in step 3). I suppose if this is not a vision
>> the arrow secs have, I'd be interested in how. Not looking for any promises
>> yet though.
>> On Wed, Feb 24, 2016 at 14:52 Corey Nolet <cjnolet@apache.org> wrote:
>>
>> > So far, how does the integration with the Spark project look? Do you
>> > envision cached Spark partitions allocating Arrows? I could imagine this
>> > would be absolutely huge for being able to ask questions of real-time
>> data
>> > sets across applications.
>> >
>> > On Wed, Feb 24, 2016 at 2:49 PM, Zhe Zhang <zhz@apache.org> wrote:
>> >
>> > > Thanks for the insights Jacques. Interesting to learn the thoughts on
>> > > zero-copy sharing.
>> > >
>> > > mmap allows sharing address spaces via filesystem interface. That has
>> > some
>> > > security concerns but with the immutability restrictions (as you
>> > clarified
>> > > here), it sounds feasible. What other options do you have in mind?
>> > >
>> > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet <cjnolet@gmail.com>
>> wrote:
>> > >
>> > > > Agreed,
>> > > >
>> > > > I thought the whole purpose was to share the memory space (using
>> > possibly
>> > > > unsafe operations like ByteBuffers) so that it could be directly
>> shared
>> > > > without copy. My interest in this is to have it enable fully
>> in-memory
>> > > > computation. Not just "processing" as in Spark, but as a fully
>> > in-memory
>> > > > datastore that one application can expose and share with others (e.g.
>> > In
>> > > > memory structure is constructed from a series of parquet files,
>> > somehow,
>> > > > then Spark pulls it in, does some computations, exposes a data set,
>> > > > etc...).
>> > > >
>> > > > If you are leaving the allocation of the memory to the applications
>> and
>> > > > underneath the memory is being allocated using direct bytebuffers,
I
>> > > can't
>> > > > see exactly why the problem is fundamentally hard- especially if the
>> > > > applications themselves are worried about exposing their own memory
>> > > spaces.
>> > > >
>> > > >
>> > > > On Wed, Feb 24, 2016 at 2:17 PM, Andrew Brust <
>> > > > andrew.brust@bluebadgeinsights.com> wrote:
>> > > >
>> > > > > Hmm...that's not exactly how Jaques described things to me when
he
>> > > > briefed
>> > > > > me on Arrow ahead of the announcement.
>> > > > >
>> > > > > -----Original Message-----
>> > > > > From: Zhe Zhang [mailto:zhz@apache.org]
>> > > > > Sent: Wednesday, February 24, 2016 2:08 PM
>> > > > > To: dev@arrow.apache.org
>> > > > > Subject: Re: Question about mutability
>> > > > >
>> > > > > I don't think one application/process's memory space will be
made
>> > > > > available to other applications/processes. It's fundamentally
hard
>> > for
>> > > > > processes to share their address spaces.
>> > > > >
>> > > > > IIUC, with Arrow, when application A shares data with application
>> B,
>> > > the
>> > > > > data is still duplicated in the memory spaces of A and B. It's
just
>> > > that
>> > > > > data serialization/deserialization are much faster with Arrow
>> > (compared
>> > > > > with Protobuf).
>> > > > >
>> > > > > On Wed, Feb 24, 2016 at 10:40 AM Corey Nolet <cjnolet@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > Forgive me if this question seems ill-informed. I just started
>> > > looking
>> > > > > > at Arrow yesterday. I looked around the github a tad.
>> > > > > >
>> > > > > > Are you expecting the memory space held by one application
to be
>> > > > > > mutable by that application and made available to all
>> applications
>> > > > > > trying to read the memory space?
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> --
>> --
>> Cheers,
>> Leif
>>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message