arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Cross-implementation metadata specification, IPC details
Date Tue, 01 Mar 2016 21:13:25 GMT
Inline responses

On Tue, Mar 1, 2016 at 8:24 AM, Jacques Nadeau <jacques@apache.org> wrote:
> Wes, thanks for starting this conversation.
>
> Couple thoughts:
>
> For metadata, we have two models existing (one in the ValueVectors approach
> and one in Parquet). It seems like we should start from one of those and
> then shape as appropriate. It seems like we have a richer physical
> capability that the core Dremel algorithm that Parquet implements so I
> think it would make sense to focus first on the logical model and then
> figure out the shared physical that exists below that.
>
> While the Data Headers item (2) in your description may come logically
> second, I think that it greatly informs 1.B as I believe 2 is something
> that should be an in-memory canonical representation (similar to the
> vectors themselves). I know Steven has been looking at moving the Java
> layer over to serialize the data headers using something similar to this:
>
> Data headers use a deterministic pre-order "tree" ordering of the memory
> buffers (https://en.wikipedia.org/wiki/Tree_traversal). The data structures
> are simply an array of data headers consisting of a list of buffer offsets
> and sizes.
>

This makes sense, and is consistent with Parquet's depth-first
(pre-order) flattening of nested schemas into a flat
list<SchemaElement>
(https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L556).

> For example, consider this schema:
>
> List<Struct<String=List<UInt8>, Int32>>
>
> the pre-order buffer order is
>
> 0: nulls top level list
>
> 1: list offsets
>
> 2: struct field 0 nulls
>
> 3: struct field 0 list offsets
>
> 4: struct field 0 inner UInt8 values
>
> 5: struct field 1 nulls
>
> 6: struct field 1 Int32 values
>

Regarding the buffers, we would want to embed the null count for each
logical array in some way, so that the null bitmap can be omitted.
This also spares having to compute the actual null counts when
receiving a row batch.

For example, a List<T>, we could have a couple cases:

List
- null_count > 0 -> 2 buffers, 1 for the null bitmap, 1 for the list offsets
- null_count == 0 -> 1 buffer for the list offsets

Recursively, the buffers for the type T child data would similarly
have its own null count, determining the actual number of memory
buffers. Let me know what you think.

> The flatbuffer schema for the data header would then be:
>
> namespace DataHeaders;
>
> struct Buffer {
>
>  data: long;
>
>  length: int;
>
> }
>
> // Representing a single array (aka ValueVector), typically
>
> table BufferList {
>
>  // With FBS it is not possible to know the length of an array
>
>  n_buffers: int;
>
>  buffers: [Buffer];
>
> }
>
> // Multiple arrays -- could be used for long arrays or a
>
> // whole table row batch
>
> table ArrayBatch {
>
>  n_arrays: int;
>
>  arrays: [BufferList];
>
> }

I've been tinkering with flatbuffers, and array lengths are indeed
available in the generated IDL (see flatbuffers::Vector::size in the
C++ API, for example), so the n_buffers / n_arrays fields aren't
needed.

Per the discussion above re: null counts, something like this may work:

struct Buffer {
 data: long;
 length: int;
}

table ArrayNode {
  length: int;
  null_count: int;

  /// The number of buffers would be inferred from the node type and actual
  /// null_count
  buffers: [Buffer];
}

table RowBatch {
  /// Nodes correspond to the depth-first flattened logical schema
  nodes: [ArrayNode];
}

So in an example schema:

row batch schema {
  f0: List<List<UInt8>>,
  f1: Int32
}

The flattened version of this schema contains 4 ArrayNodes, 3 for the
nested f0 column, and 1 for the flat f1 column.

- Wes

>
>
> On Mon, Feb 29, 2016 at 6:13 PM, Wes McKinney <wes@cloudera.com> wrote:
>
>> hello all,
>>
>> I wanted to kick-start the process of coming up with a standardized /
>> canonical metadata specification that we can use for describing Arrow
>> data to be moved between systems. This breaks down into at least two
>> distinct kinds of metadata
>>
>> 1) "Schemas": physical types, logical types, child array types, struct
>> field names, and so forth. Does not contain information about the size
>> of the actual physical data (which depends on the length of arrays and
>> the sizes of list/variable-length type dimensions).
>>
>> 2) "Data headers": a description of the shape of a physical chunk of
>> data associated with a particular schema. Array length, null count,
>> memory buffer offsets and sizes, etc. This is the information you need
>> to compute the right pointers into a shared memory region or IPC/RPC
>> buffer and reconstruct Arrow container classes.
>>
>> Since #2 will depend on some of the details of #1, I suggest we start
>> figuring out #1 first. As far as the type metadata is concerned, to
>> avoid excess bike shedding we should break that problem into:
>>
>> A) The general layout of the type metadata / schemas
>> B) The technology we use for representing the schemas (and data
>> headers) in an implementation-independent way for use in an IPC/RPC
>> setting (and even to "store" ephemeral data on disk)
>>
>> On Item B, from what I've seen with Parquet and such file formats with
>> embedded metadata, and in the spirit of Arrow's "deserialize-nothing"
>> ethos, I suggest we explore no-deserialization technologies like
>> Google's Flatbuffers (https://github.com/google/flatbuffers) as a more
>> CPU-efficient alternative to Thrift, Protobuf, or Avro. In large
>> schemas, technologies like Thrift can result in significant overhead
>> in "needle-in-haystack" problems where you are picking only a few
>> columns out of very wide tables (> 1000s of columns), and it may be
>> best to try to avoid this if at all possible.
>>
>> I would like some help stewarding the design process on this from the
>> Arrow PMC and in particular those who have worked on the design and
>> implementation of Parquet and other file formats and systems for which
>> Arrow is an immediate intended companion. Lot of things we can learn
>> from those past experiences.
>>
>> Thank you,
>> Wes
>>

Mime
View raw message