arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [DISCUSS] C-level in-process array protocol
Date Tue, 01 Oct 2019 09:19:17 GMT

Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> A couple things:
> 
> * I think a C protocol / FFI for Arrow array/vectors would be better
> to have the same "shape" as an assembled array. Note that the C
> structs here have very nearly the same "shape" as the data structure
> representing a C++ Array object [1]. The disassembly and reassembly
> here is substantially simpler than the IPC protocol. A recursive
> structure in Flatbuffers would make RecordBatch messages much larger,
> so the flattened / disassembled representation we use for serialized
> record batches is the correct one

I'm not sure I agree:

- indeed, it's not a coincidence that the ArrowArray struct looks quite
closely like the C++ ArrayData object :-)  We have good experience with
that abstraction and it has proven to work quite well

- the IPC format is meant for serialization while the C data protocol is
meants for in-memory communication, so different concerns apply

- the fact that this makes the layout slightly larger doesn't seem
important at all; we're not talking about transferring data over the wire

There's also another argument for having a recursive struct: it
simplifies how the data type is represented, since we can encode each
child type individually instead of encoding it in the parent's format
string (same applies for metadata and individual flags).

> * The "formal" C protocol having the "assembled" shape means that many
> minimal Arrow users won't have to implement any separate data
> structures. They can just use the C struct directly or a slightly
> wrapped version thereof with some convenience functions.

Yes, but the same applies to the current proposal.

> * I think that requiring building a Flatbuffer for minimal use cases
> (e.g. communicating simple record batches with primitive types) passes
> on implementation burden to minimal users.

It certainly does.

> I think the mantra of the C protocol should be the following:
> 
> * Users of the protocol have to write little to no code to use it. For
> example, populating an INT32 array should require only a few lines of
> code

Agreed.  As a sidenote, the spec should have an example of doing this in
raw C.

Regards

Antoine.

Mime
View raw message