arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: Understanding "shared" memory implications
Date Wed, 16 Mar 2016 17:07:04 GMT
I always thought Arrow was just an in-memory format, and it is the
responsibility of whoever else that want to use it to carry that
responsibilities out, because depending on workloads, different frameworks
might pick very different applications. Otherwise it seems to be doing too
much and having too strong of an opinion about data sharing in a format
that's primarily about data sharing.

On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cjnolet@gmail.com> wrote:

> I've been under the impression that exposing memory to be shared directly
> and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
> this in [1] and this is turned me on to Arrow in the first place.
>
>
> [1]
>
> http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
>
> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jhyde@apache.org> wrote:
>
> > This is all very interesting stuff, but just so we’re clear: it is not
> > Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
> facilities
> > for resource management. If we DID decide to make this Arrow’s
> > responsibility it would overlap with other components which specialize in
> > such stuff.
> >
> >
> >
> > > On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <jacques@apache.org>
> wrote:
> > >
> > > @Todd: agree entirely on prototyping design. My goal is throw out some
> > > ideas and some POC code and then we can explore from there.
> > >
> > > My main thoughts have initially been around lifecycle management. I've
> > done
> > > some work previously where a consistently sized shared buffer using
> mmap
> > > has improved performance. This is more complicated given the
> requirements
> > > for providing collaborative allocation and cross process reference
> > counts.
> > >
> > > With regards to whether this is more generally applicable: I think it
> > could
> > > ultimately be more general but I suggest we focus on the particular
> > > application of moving long-lived arrow record batches between a
> producer
> > > and a consumer initially. Constraining the problems seems like we will
> > get
> > > to something workable sooner. We can abstract to a more general
> solution
> > as
> > > there are other clear requirements.
> > >
> > > With regards to capnproto, I believe they are simply saying when they
> > talk
> > > about zero-copy shared memory that the structure supports that (same as
> > any
> > > memory-layout based design). I don't believe they actually implemented
> a
> > > protocol and multi-language implementation for zero-copy cross process
> > > communication.
> > >
> > > One other note to make here is that my goal here is not just about
> > > performance but also about memory footprint. Being able to have a
> shared
> > > memory protocol that allows multiple tools to interact with the same
> hot
> > > dataset.
> > >
> > > RE: ACL, for the initial focus, I suggest that we consider the two
> > sharing
> > > processes are "trusted" and expect the initial Arrow API reference
> > > implementations to manage memory access.
> > >
> > > Regarding other questions that Todd threw out:
> > >
> > > - if you are using an mmapped file in /dev/shm/, how do you make sure
> it
> > > gets cleaned up if the process crashes?
> > >
> > >>> Agreed that it needs to get resolve. If I recall, destruction can be
> > > applied once associated process are attached to memory and this allows
> > the
> > > kernel to recover once all attaching processes are destroyed. If this
> > isn't
> > > enough, then we may very well need a simple  external coordinator.
> > >
> > > - how do you allocate memory to it? there's nothing ensuring that
> > /dev/shm
> > > doesn't swap out if you try to put too much in there, and then your
> > > in-memory super-fast access will basically collapse under swap
> thrashing
> > >
> > >>> Simplest model initially is probably one where we assume a master
> and a
> > > slave. (Ideally negotiated on initial connection.) The master is
> > > responsible for allocating memory and giving that to the slave. The
> > master
> > > then is responsible for managing reasonable memory allocation limits
> just
> > > like any other. Slaves that need to allocated memory must ask the
> master
> > > (at whatever chunk makes sense) and will get rejected if they are too
> > > aggressive. (this probably means that at any point an IPC can fall back
> > to
> > > RPC??)
> > >
> > > - how do you do lifecycle management across the two processes? If, say,
> > > Kudu wants to pass a block of data to some Python program, how does it
> > know
> > > when the Python program is done reading it and it should be deleted?
> What
> > > if the python program crashed in the middle - when can Kudu release it?
> > >
> > >>> My thinking, as mentioned earlier, is a shared reference count model
> > for
> > > complex situations. Possibly a "request/response" ownership model for
> > > simpler cases.
> > >
> > > - how do you do security? If both sides of the connection don't trust
> > each
> > > other, and use length prefixes and offsets, you have to be constantly
> > > validating and re-validating everything you read.
> > >
> > > I'm suggesting that we start with trusting so we don't get too wrapped
> up
> > > in all the extra complexities of security. My experience with these
> > things
> > > is that a lot of users will frequently pick performance or footprint
> over
> > > security for quite some time. For example, if I recall correctly, on
> the
> > > shared file descriptor model that was initially implemented in the HDFS
> > > client, that people used short-circuit reads for years before security
> > was
> > > correctly implemented. (Am I remembering this right?)
> > >
> > > Lastly, as I mentioned above, I don't think there should be any
> > requirement
> > > that Arrow communication be limited to only 'IPC'. As Todd points out,
> in
> > > many cases unix domain sockets will be just fine.
> > >
> > > We need to implement both models because we all know that locality will
> > > never be guaranteed. The IPC design/implementation needs to be good for
> > > anything to make into arrow.
> > >
> > > thanks
> > > Jacques
> > >
> > >
> > >
> > > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zhz@apache.org> wrote:
> > >
> > >> I have similar concerns as Todd stated below. With an mmap-based
> > approach,
> > >> we are treating shared memory objects like files. This brings in all
> > >> filesystem related considerations like ACL and lifecycle mgmt.
> > >>
> > >> Stepping back a little, the shared-memory work isn't really specific
> to
> > >> Arrow. A few questions related to this:
> > >> 1) Has the topic been discussed in the context of protobuf (or other
> IPC
> > >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> > >> zero-copy
> > >> shared memory. I haven't read implementation detail though.
> > >> 2) If the shared-memory work benefits a wide range of protocols,
> should
> > it
> > >> be a generalized and standalone library?
> > >>
> > >> Thanks,
> > >> Zhe
> > >>
> > >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <todd@cloudera.com>
> wrote:
> > >>
> > >>> Having thought about this quite a bit in the past, I think the
> > mechanics
> > >> of
> > >>> how to share memory are by far the easiest part. The much harder part
> > is
> > >>> the resource management and ownership. Questions like:
> > >>>
> > >>> - if you are using an mmapped file in /dev/shm/, how do you make sure
> > it
> > >>> gets cleaned up if the process crashes?
> > >>> - how do you allocate memory to it? there's nothing ensuring that
> > >> /dev/shm
> > >>> doesn't swap out if you try to put too much in there, and then your
> > >>> in-memory super-fast access will basically collapse under swap
> > thrashing
> > >>> - how do you do lifecycle management across the two processes? If,
> say,
> > >>> Kudu wants to pass a block of data to some Python program, how does
> it
> > >> know
> > >>> when the Python program is done reading it and it should be deleted?
> > What
> > >>> if the python program crashed in the middle - when can Kudu release
> it?
> > >>> - how do you do security? If both sides of the connection don't trust
> > >> each
> > >>> other, and use length prefixes and offsets, you have to be constantly
> > >>> validating and re-validating everything you read.
> > >>>
> > >>> Another big factor is that shared memory is not, in my experience,
> > >>> immediately faster than just copying data over a unix domain socket.
> In
> > >>> particular, the first time you read an mmapped file, you'll end up
> > paying
> > >>> minor page fault overhead on every page. This can be improved with
> > >>> HugePages, but huge page mmaps are not supported yet in current Linux
> > >> (work
> > >>> going on currently to address this). So you're left with hugetlbfs,
> > which
> > >>> involves static allocations and much more pain.
> > >>>
> > >>> All the above is a long way to say: let's make sure we do the write
> > >>> prototyping and up-front design before jumping into code.
> > >>>
> > >>> -Todd
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <jacques@apache.org>
> > >>> wrote:
> > >>>
> > >>>> @Corey
> > >>>> The POC Steven and Wes are working on is based on MappedBuffer
but
> I'm
> > >>>> looking at using netty's fork of tcnative to use shared memory
> > >> directly.
> > >>>>
> > >>>> @Yiannis
> > >>>> We need to have both RPC and a shared memory mechanisms (what I'm
> > >>> inclined
> > >>>> to call IPC but is a specific kind of IPC). The idea is we negotiate
> > >> via
> > >>>> RPC and then if we determine shared locality, we work over shared
> > >> memory
> > >>>> (preferably for both data and control). So the system interacting
> with
> > >>>> HBase in your example would be the one responsible for placing
> > >> collocated
> > >>>> execution to take advantage of IPC.
> > >>>>
> > >>>> How do others feel of my redefinition of IPC to mean the same memory
> > >>> space
> > >>>> communication (either via shared memory or rdma) versus RPC as
> socket
> > >>> based
> > >>>> communication?
> > >>>>
> > >>>>
> > >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cjnolet@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> I was seeing Netty's unsafe classes being used here, not mapped
> byte
> > >>>>> buffer  not sure if that statement is completely correct but
I'll
> > >> have
> > >>> to
> > >>>>> dog through the code again to figure that out.
> > >>>>>
> > >>>>> The more I was looking at unsafe, it makes sense why that would
be
> > >>>>> used.apparently it's also supposed to be included on Java 9
as a
> > >> first
> > >>>>> class API
> > >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <wes@cloudera.com>
wrote:
> > >>>>>
> > >>>>>> My understanding is that you can use java.nio.MappedByteBuffer
to
> > >>> work
> > >>>>>> with memory-mapped files as one way to share memory pages
between
> > >>> Java
> > >>>>>> (and non-Java) processes without copying.
> > >>>>>>
> > >>>>>> I am hoping that we can reach a POC of zero-copy Arrow
memory
> > >> sharing
> > >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed
this will
> > >>> have
> > >>>>>> huge implications once we get it working end to end (for
example,
> > >>>>>> receiving memory from a Java process in Python without
a heavy
> > >> ser-de
> > >>>>>> step -- it's what we've always dreamed of) and with the
metadata
> > >> and
> > >>>>>> shared memory control flow standardized.
> > >>>>>>
> > >>>>>> - Wes
> > >>>>>>
> > >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cjnolet@gmail.com>
> > >>>> wrote:
> > >>>>>>> If I understand correctly, Arrow is using Netty underneath
which
> > >> is
> > >>>>>> using Sun's Unsafe API in order to allocate direct byte
buffers
> off
> > >>>> heap.
> > >>>>>> It is using Netty to communicate between "client" and "server",
> > >>>>> information
> > >>>>>> about memory addresses for data that is being requested.
> > >>>>>>>
> > >>>>>>> I've never attempted to use the Unsafe API to access
off heap
> > >>> memory
> > >>>>>> that has been allocated in one JVM from another JVM but
I'm
> > >> assuming
> > >>>> this
> > >>>>>> must be the case in order to claim that the memory is being
> > >> accessed
> > >>>>>> directly without being copied, correct?
> > >>>>>>>
> > >>>>>>> The implication here is huge. If the memory is being
directly
> > >>> shared
> > >>>>>> across processes by them being allowed to directly reach
into the
> > >>>> direct
> > >>>>>> byte buffers, that's true shared memory. Otherwise, if
there's
> > >> copies
> > >>>>> going
> > >>>>>> on, it's less appealing.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks.
> > >>>>>>>
> > >>>>>>> Sent from my iPad
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Todd Lipcon
> > >>> Software Engineer, Cloudera
> > >>>
> > >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message