arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chao Sun <sunc...@apache.org>
Subject Re: Cross-implementation metadata specification, IPC details
Date Mon, 07 Mar 2016 01:02:04 GMT
Thanks Wes. This makes sense.

Best,
Chao

On Sun, Mar 6, 2016 at 2:01 PM, Wes McKinney <wes@cloudera.com> wrote:

> Chao -- yes, I believe you're right.
>
> In an IPC message, if the node is required or has null_count == 0,
> then the null bitmap can be omitted from the payload, but otherwise
> it's there (this is similar to the Hive HS2 V6 protocol).
>
> One alternative to this is that rather than omitting the bitmap, its
> data header metadata can be included but the buffer size is zero.
>
> - Wes
>
> On Sun, Mar 6, 2016 at 12:33 PM, Chao Sun <sunchao@apache.org> wrote:
> > Hi,
> >
> > Sorry if this is a n00b question. For the example that Jacques used in
> the
> > previous thread, how does it work if the struct is nullable - shouldn't
> > there
> > be a is_null array between 1 and 2?
> >
> > For instance, with the example:
> >
> > list elem #0: { <f0: aaa, f1: 42>, null, <f0: null, f1: 28> }
> > list elem #1: { <f0: bbb, f1: null> }
> > list elem #2: null
> > list elem #3: { <f0: null, f1: null> }
> >
> > what does the encoding look like?
> >
> > I'm thinking: (0 = false, 1 = true in is_null arrays):
> >
> > 0 (list is_null): 0 0 1 0
> > 1 (list offset): 0 3 4 4 5
> > 2 (struct is_null): 0 1 0 0 0
> > 3 (struct field f0 is_null): 0 1 0 1
> > 4 (struct field f0 offset): 0 3 3 6 6
> > 5 (struct field f0 value): a a a b b b
> > 6 (struct field f1 is_null): 1 1 0 0
> > 7 (struct field f1 value): 42 28
> >
> > Let me know if this is wrong. Thanks!
> >
> > Best,
> > Chao
> >
> > On Tue, Mar 1, 2016 at 1:13 PM, Wes McKinney <wes@cloudera.com> wrote:
> >
> >> Inline responses
> >>
> >> On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacques@apache.org>
> wrote:
> >> > Wes, thanks for starting this conversation.
> >> >
> >> > Couple thoughts:
> >> >
> >> > For metadata, we have two models existing (one in the ValueVectors
> >> approach
> >> > and one in Parquet). It seems like we should start from one of those
> and
> >> > then shape as appropriate. It seems like we have a richer physical
> >> > capability that the core Dremel algorithm that Parquet implements so I
> >> > think it would make sense to focus first on the logical model and then
> >> > figure out the shared physical that exists below that.
> >> >
> >> > While the Data Headers item (2) in your description may come logically
> >> > second, I think that it greatly informs 1.B as I believe 2 is
> something
> >> > that should be an in-memory canonical representation (similar to the
> >> > vectors themselves). I know Steven has been looking at moving the Java
> >> > layer over to serialize the data headers using something similar to
> this:
> >> >
> >> > Data headers use a deterministic pre-order "tree" ordering of the
> memory
> >> > buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data
> >> structures
> >> > are simply an array of data headers consisting of a list of buffer
> >> offsets
> >> > and sizes.
> >> >
> >>
> >> This makes sense, and is consistent with Parquet's depth-first
> >> (pre-order) flattening of nested schemas into a flat
> >> list<SchemaElement>
> >> (
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556
> >> ).
> >>
> >> > For example, consider this schema:
> >> >
> >> > List<Struct<String=List<UInt8>, Int32>>
> >> >
> >> > the pre-order buffer order is
> >> >
> >> > 0: nulls top level list
> >> >
> >> > 1: list offsets
> >> >
> >> > 2: struct field 0 nulls
> >> >
> >> > 3: struct field 0 list offsets
> >> >
> >> > 4: struct field 0 inner UInt8 values
> >> >
> >> > 5: struct field 1 nulls
> >> >
> >> > 6: struct field 1 Int32 values
> >> >
> >>
> >> Regarding the buffers, we would want to embed the null count for each
> >> logical array in some way, so that the null bitmap can be omitted.
> >> This also spares having to compute the actual null counts when
> >> receiving a row batch.
> >>
> >> For example, a List<T>, we could have a couple cases:
> >>
> >> List
> >> - null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list
> >> offsets
> >> - null_count == 0 -> 1 buffer for the list offsets
> >>
> >> Recursively, the buffers for the type T child data would similarly
> >> have its own null count, determining the actual number of memory
> >> buffers. Let me know what you think.
> >>
> >> > The flatbuffer schema for the data header would then be:
> >> >
> >> > namespace DataHeaders;
> >> >
> >> > struct Buffer {
> >> >
> >> >  data: long;
> >> >
> >> >  length: int;
> >> >
> >> > }
> >> >
> >> > // Representing a single array (aka ValueVector), typically
> >> >
> >> > table BufferList {
> >> >
> >> >  // With FBS it is not possible to know the length of an array
> >> >
> >> >  n_buffers: int;
> >> >
> >> >  buffers: [Buffer];
> >> >
> >> > }
> >> >
> >> > // Multiple arrays -- could be used for long arrays or a
> >> >
> >> > // whole table row batch
> >> >
> >> > table ArrayBatch {
> >> >
> >> >  n_arrays: int;
> >> >
> >> >  arrays: [BufferList];
> >> >
> >> > }
> >>
> >> I've been tinkering with flatbuffers, and array lengths are indeed
> >> available in the generated IDL (see flatbuffers::Vector::size in the
> >> C++ API, for example), so the n_buffers / n_arrays fields aren't
> >> needed.
> >>
> >> Per the discussion above re: null counts, something like this may work:
> >>
> >> struct Buffer {
> >>  data: long;
> >>  length: int;
> >> }
> >>
> >> table ArrayNode {
> >>   length: int;
> >>   null_count: int;
> >>
> >>   /// The number of buffers would be inferred from the node type and
> actual
> >>   /// null_count
> >>   buffers: [Buffer];
> >> }
> >>
> >> table RowBatch {
> >>   /// Nodes correspond to the depth-first flattened logical schema
> >>   nodes: [ArrayNode];
> >> }
> >>
> >> So in an example schema:
> >>
> >> row batch schema {
> >>   f0: List<List<UInt8>>,
> >>   f1: Int32
> >> }
> >>
> >> The flattened version of this schema contains 4 ArrayNodes, 3 for the
> >> nested f0 column, and 1 for the flat f1 column.
> >>
> >> - Wes
> >>
> >> >
> >> >
> >> > On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <wes@cloudera.com>
> wrote:
> >> >
> >> >> hello all,
> >> >>
> >> >> I wanted to kick-start the process of coming up with a standardized
/
> >> >> canonical metadata specification that we can use for describing Arrow
> >> >> data to be moved between systems. This breaks down into at least two
> >> >> distinct kinds of metadata
> >> >>
> >> >> 1) "Schemas": physical types, logical types, child array types,
> struct
> >> >> field names, and so forth. Does not contain information about the
> size
> >> >> of the actual physical data (which depends on the length of arrays
> and
> >> >> the sizes of list/variable-length type dimensions).
> >> >>
> >> >> 2) "Data headers": a description of the shape of a physical chunk of
> >> >> data associated with a particular schema. Array length, null count,
> >> >> memory buffer offsets and sizes, etc. This is the information you
> need
> >> >> to compute the right pointers into a shared memory region or IPC/RPC
> >> >> buffer and reconstruct Arrow container classes.
> >> >>
> >> >> Since #2 will depend on some of the details of #1, I suggest we start
> >> >> figuring out #1 first. As far as the type metadata is concerned, to
> >> >> avoid excess bike shedding we should break that problem into:
> >> >>
> >> >> A) The general layout of the type metadata / schemas
> >> >> B) The technology we use for representing the schemas (and data
> >> >> headers) in an implementation-independent way for use in an IPC/RPC
> >> >> setting (and even to "store" ephemeral data on disk)
> >> >>
> >> >> On Item B, from what I've seen with Parquet and such file formats
> with
> >> >> embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
> >> >> ethos, I suggest we explore no-deserialization technologies like
> >> >> Google's Flatbuffers (https://github.com/google/flatbuffers) as a
> more
> >> >> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
> >> >> schemas, technologies like Thrift can result in significant overhead
> >> >> in "needle-in-haystack" problems where you are picking only a few
> >> >> columns out of very wide tables (> 1000s of columns), and it may
be
> >> >> best to try to avoid this if at all possible.
> >> >>
> >> >> I would like some help stewarding the design process on this from the
> >> >> Arrow PMC and in particular those who have worked on the design and
> >> >> implementation of Parquet and other file formats and systems for
> which
> >> >> Arrow is an immediate intended companion. Lot of things we can learn
> >> >> from those past experiences.
> >> >>
> >> >> Thank you,
> >> >> Wes
> >> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message