arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: Question on Exactness of Arrow Memory Layout
Date Thu, 02 Jun 2016 15:40:19 GMT
What we're ultimately targeting is that the flatbuffer pointers Micah noted
in [1] above can work a non-contiguous memory region.

Take a look here at [2]. Each of the colored boxes should be contiguous,
but they don't need to be packed together in memory for IPC. Note that the
"data header" in [2] are the flatbuf described in [1].


[1] https://github.com/apache/arrow/blob/master/format/Message.fbs
[2]
https://docs.google.com/presentation/d/1bB26ZNUq_YDsjXCtIp2UXWJFvN1P3wn_w-yKCzxlC8A/edit#slide=id.p29

On Wed, Jun 1, 2016 at 12:14 PM, Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Jacob,
> The current rough prototype/proposal of the IPC via shared memory is
> to do a depth first traversal of each arrays buffer and write them out
> to a contiguous memory block.  Metadata about array types and
> locations of buffers is persisted at the end of memory block in
> flatbuffer format [1].  Reading it back is a matter of using the
> metadata to create a structure (like the one you have above) that has
> pointers back to the contiguous memory block.  The work in progress
> C++ version of this located at [2].
>
> I hope this helps.
>
> [1] https://github.com/apache/arrow/blob/master/format/Message.fbs
> [2]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/adapter.cc
>
> Cheers,
> Micah
>
> On Wed, Jun 1, 2016 at 10:53 AM, Jacob Quinn <quinn.jacobd@gmail.com>
> wrote:
> > Having become familiar with the Arrow memory layout, and taking a stab at
> > an implementation in the Julia language, I've come up with a perhaps
> naive
> > question.
> >
> > A "type" (class) I have defined so far is:
> >
> > immutable Column{T} <: ArrowColumn{T}
> >     buffer::Vector{UInt8} # potential reference to mmap
> >     length::Int32
> >     null_count::Int32
> >     nulls::BitVector # null == 0 == false, not-null == 1 == true; always
> > padded to 64-byte alignments
> >     values::Vector{T} # always padded to 64-byte alignments
> > end
> >
> >
> > which aims to be an array/column that holds any "primitive" bits type
> `T`.
> > Note the exact layout matching with "length", "null_count", "nulls", and
> > "values".
> >
> > The additional reference, however, is the "buffer" field, which holds a
> > reference to a byte buffer. This would be technically optional if the
> > `nulls` and `values` fields owned their own memory, but there are other
> > cases where `buffer` would own, for example, memory-mapped bytes that
> > `nulls` and `values` would be sharing.
> >
> > My question is if this somehow "violates" the Arrow memory layout by
> > including this additional `buffer` reference in my class?
> >
> > It begs a larger question of what exactly the inter-language "API" looks
> > like. I'm assuming it's not as strict as needing to be able to pass a
> > pointer to another process that would be able to auto-wrap as it's own
> > Arrow structure; but I think I read somewhere that it IS aiming for some
> > kind of "memcpy" operation. Any light anyone can shed would be most
> > welcome; help me know if I'm perhaps over-thinking this at this stage.
> >
> > -Jacob
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message