arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS] C-level in-process array protocol
Date Tue, 01 Oct 2019 12:42:07 GMT
hi Antoine,

On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <antoine@python.org> wrote:
>
>
> Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > A couple things:
> >
> > * I think a C protocol / FFI for Arrow array/vectors would be better
> > to have the same "shape" as an assembled array. Note that the C
> > structs here have very nearly the same "shape" as the data structure
> > representing a C++ Array object [1]. The disassembly and reassembly
> > here is substantially simpler than the IPC protocol. A recursive
> > structure in Flatbuffers would make RecordBatch messages much larger,
> > so the flattened / disassembled representation we use for serialized
> > record batches is the correct one
>
> I'm not sure I agree:
>
> - indeed, it's not a coincidence that the ArrowArray struct looks quite
> closely like the C++ ArrayData object :-)  We have good experience with
> that abstraction and it has proven to work quite well
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> - the fact that this makes the layout slightly larger doesn't seem
> important at all; we're not talking about transferring data over the wire
>
> There's also another argument for having a recursive struct: it
> simplifies how the data type is represented, since we can encode each
> child type individually instead of encoding it in the parent's format
> string (same applies for metadata and individual flags).
>

I was saying something different here. I was making an argument about
why we use the flattened array-of-structs in the IPC protocol. One
reason is that it's a more compact representation. That is not very
important here because this protocol is only for *in-process* (for
languages that have a C FFI facility) rather than *inter-process*
communication.

I agree also that the type encoding is simple, here, too, since we
aren't having to split the schema and record batch between different
serialized messages. There is some potential waste with having to
populate the type fields multiple times when communicating a sequence
of "chunks" from the same logical dataset.

> > * The "formal" C protocol having the "assembled" shape means that many
> > minimal Arrow users won't have to implement any separate data
> > structures. They can just use the C struct directly or a slightly
> > wrapped version thereof with some convenience functions.
>
> Yes, but the same applies to the current proposal.
>
> > * I think that requiring building a Flatbuffer for minimal use cases
> > (e.g. communicating simple record batches with primitive types) passes
> > on implementation burden to minimal users.
>
> It certainly does.
>
> > I think the mantra of the C protocol should be the following:
> >
> > * Users of the protocol have to write little to no code to use it. For
> > example, populating an INT32 array should require only a few lines of
> > code
>
> Agreed.  As a sidenote, the spec should have an example of doing this in
> raw C.
>
> Regards
>
> Antoine.

Mime
View raw message