arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: Cross-implementation metadata specification, IPC details
Date Tue, 01 Mar 2016 16:24:07 GMT
Wes, thanks for starting this conversation.

Couple thoughts:

For metadata, we have two models existing (one in the ValueVectors approach
and one in Parquet). It seems like we should start from one of those and
then shape as appropriate. It seems like we have a richer physical
capability that the core Dremel algorithm that Parquet implements so I
think it would make sense to focus first on the logical model and then
figure out the shared physical that exists below that.

While the Data Headers item (2) in your description may come logically
second, I think that it greatly informs 1.B as I believe 2 is something
that should be an in-memory canonical representation (similar to the
vectors themselves). I know Steven has been looking at moving the Java
layer over to serialize the data headers using something similar to this:

Data headers use a deterministic pre-order "tree" ordering of the memory
buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data structures
are simply an array of data headers consisting of a list of buffer offsets
and sizes.

For example, consider this schema:

List<Struct<String=List<UInt8>, Int32>>

the pre-order buffer order is

0: nulls top level list

1: list offsets

2: struct field 0 nulls

3: struct field 0 list offsets

4: struct field 0 inner UInt8 values

5: struct field 1 nulls

6: struct field 1 Int32 values

The flatbuffer schema for the data header would then be:

namespace DataHeaders;

struct Buffer {

 data: long;

 length: int;

}

// Representing a single array (aka ValueVector), typically

table BufferList {

 // With FBS it is not possible to know the length of an array

 n_buffers: int;

 buffers: [Buffer];

}

// Multiple arrays -- could be used for long arrays or a

// whole table row batch

table ArrayBatch {

 n_arrays: int;

 arrays: [BufferList];

}


On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <wes@cloudera.com> wrote:

> hello all,
>
> I wanted to kick-start the process of coming up with a standardized /
> canonical metadata specification that we can use for describing Arrow
> data to be moved between systems. This breaks down into at least two
> distinct kinds of metadata
>
> 1) "Schemas": physical types, logical types, child array types, struct
> field names, and so forth. Does not contain information about the size
> of the actual physical data (which depends on the length of arrays and
> the sizes of list/variable-length type dimensions).
>
> 2) "Data headers": a description of the shape of a physical chunk of
> data associated with a particular schema. Array length, null count,
> memory buffer offsets and sizes, etc. This is the information you need
> to compute the right pointers into a shared memory region or IPC/RPC
> buffer and reconstruct Arrow container classes.
>
> Since #2 will depend on some of the details of #1, I suggest we start
> figuring out #1 first. As far as the type metadata is concerned, to
> avoid excess bike shedding we should break that problem into:
>
> A) The general layout of the type metadata / schemas
> B) The technology we use for representing the schemas (and data
> headers) in an implementation-independent way for use in an IPC/RPC
> setting (and even to "store" ephemeral data on disk)
>
> On Item B, from what I've seen with Parquet and such file formats with
> embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
> ethos, I suggest we explore no-deserialization technologies like
> Google's Flatbuffers (https://github.com/google/flatbuffers) as a more
> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
> schemas, technologies like Thrift can result in significant overhead
> in "needle-in-haystack" problems where you are picking only a few
> columns out of very wide tables (> 1000s of columns), and it may be
> best to try to avoid this if at all possible.
>
> I would like some help stewarding the design process on this from the
> Arrow PMC and in particular those who have worked on the design and
> implementation of Parquet and other file formats and systems for which
> Arrow is an immediate intended companion. Lot of things we can learn
> from those past experiences.
>
> Thank you,
> Wes
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message