arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: [DISCUSS] C-level in-process array protocol
Date Thu, 03 Oct 2019 00:46:06 GMT
I'd like to hear more opinions from others on this topic. This conversation
seems mostly dominated by comments from myself, Wes and Antoine.

I think it is reasonable to argue that keeping any ABI (or header/struct
pattern) as narrow as possible would allow us to minimize overlap with the
existing in-memory specification. In Arrow's case, this could be as simple
as a single memory pointer for schema (backed by flatbuffers) and a single
memory location for data (that references the record batch header, which in
turn provides pointers into the actual arrow data). Extensions would need
to be added for reference management as done here but I continue to think
we should defer discussion of that until the base data structures are
resolved. I see the comments here as arguing for a much broader ABI, in
part to support having people build "Arrow" components that interconnect
using this new interface. I understand the desire to expand the ABI to be
driven by needs to reduce dependencies and ease usability.

The representation within the related patch is being presented as a way for
applications to share Arrow data but is not easily accessible to all
languages. I want to avoid a situation where someone says "I produced an
Arrow API" when what they've really done is created a C interface which
only a small subset of languages can actually leverage. For example, every
language now knows how to parse the existing schema definition as rendered
in flatbuf. In order to interact with something that implements this new
pattern one would also be required to implement completely new schema
consumption code. In the proposal itself it suggests this (for example
enhancing the C++ library to consume structures produced this way).

As I said, I really want to hear more opinions. Running this past various
developers I know, many have echoed my concerns but that really doesn't
matter (and who knows how much of that is colored by my presentation of the
issue). What do people here think? If someone builds an "Arrow" library
that implements this set of structures, how does one use it in Node? In
Java? Does it drive creation of a secondary set of interfaces in each of
those languages to work with this kind of pattern? (For example, in a JVM
view of the world, working with a plain struct in java rather than a set of
memory pointers against our existing IPC formats would be quite painful and
we'd definitely need to create some glue code for users. I worry the same
pattern would occur in many other languages.)

To respond directly to some of Wes's most recent comments from the email
below. I struggle to map your description of the situation to the rest of
the thread and the proposed patch.  For example, you say that a non-goal is
"creating a new canonical way to serialize metadata" bute the patch
proposes a concrete string based encoding system to describe data types.
Aren't those things in conflict?

I'll also think more on this and challenge my own perspective. This isn't
where my focus is so my comments aren't as developed/thoughtful as I'd like.


On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Jacques,
>
> I think we've veered off course a bit and maybe we could reframe the
> discussion.
>
> Goals
> * A "drop-in" header-only C file that projects can use as a
> programming interface either internally only or to expose in-memory
> data structures between C functions at call sites. Ideally little to
> no disassembly/reassembly should be required on either "side" of the
> call site.
> * Simplifying adoption of Arrow for C programmers, or languages based
> around C FFI
>
> Non-goals
> * Expanding the columnar format or creating an alternative canonical
> in-memory representation
> * Creating a new canonical way to serialize metadata
>
> Note that this use case has been on my mind for more than 2 years:
> https://issues.apache.org/jira/browse/ARROW-1058
>
> I think there are a couple of potentially misleading things at play here
>
> 1. The use of the word "protocol". In C, a struct has a well-defined
> binary layout, so a C API is also an ABI. Using C structs to
> communicate data can be considered to be a protocol, but it means
> something different in the context of the "Arrow protocol". I think we
> need to call this a "C API"
>
> 2. The documentation for this in Antoine's PR is in the format/
> directory. It would probably be better to have a "C API" section in
> the documentation.
>
> The header file under discussion and the documentation about it is
> best considered as a "library".
>
> It might be useful at some point to create a C99 implementation of the
> IPC protocol as well using FlatCC with the goal of having a complete
> implementation of the columnar format in C with minimal binary
> footprint. This is analogous to the NanoPB project which is an
> implementation of Protocol Buffers with small code size
>
> https://github.com/nanopb/nanopb
>
> Let me know if this makes more sense.
>
> I think it's important to communicate clearly about this primarily for
> the benefit of the outside world which can confuse easily as we have
> observed over the last few years =)
>
> Wes
>
> On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <jacques@apache.org> wrote:
> >
> > I disagree with this statement:
> >
> > - the IPC format is meant for serialization while the C data protocol is
> > meants for in-memory communication, so different concerns apply
> >
> > If that is how the a particular implementation presents it, that is a
> > weaknesses of the implementation, not the format. The primary use case I
> > was focused on when working on the initial format was communication
> within
> > the same process. It seems like this is being used as a basis for the
> > introduction of new things when the premise is inconsistent with the
> > intention of the creation. The specific reason we used flatbuffers in the
> > project was to collapse the separation of in-process and out-of-process
> > communication. It means the same thing it does with the Arrow data
> itself:
> > that a consumer doesn't have to use a particular library to interact with
> > and use the data.
> >
> > It seems like there are two ideas here:
> >
> > 1) How do we make it easier for people to use Arrow?
> > 2) Should we implement a new in memory representation of Arrow that is
> > language specific.
> >
> > I'm entirely in support of number one. If for a particular type of
> domain,
> > people want an easier way to interact with Arrow, let's make a new
> library
> > that helps with that. In easy of our current libraries, we do many things
> > to make it easier to work with Arrow. None of those require a change to
> the
> > core format or are formalized as a new in-memory standard. The in-memory
> > representation of rust or javascript or java objects are implementation
> > details.
> >
> > I'm against number two as it creates a fragmentation problem. Arrow is
> > about having a single canonical format for memory for both metadata and
> > data. Having multiple in-memory formats (especially when some are not
> > language independent) is counter to the goals of the project.
>
> I don't think anyone is proposing anything that would cause fragmentation.
>
> A central question is whether it is useful to define a reusable C ABI
> for the Arrow columnar format, and if there is sufficient interest, a
> tiny C implementation of the IPC protocol (which uses the Flatbuffers
> message) that assembles and disassembles the data structures defined
> in the C ABI.
>
> We could separately create a tiny implementation of the Arrow IPC
> protocol using FlatCC that could be dropped into applications
> requiring only a C compiler and nothing else.
>
>
> >
> > Two other, separate comments:
> > 1) I don't understand the idea that we need to change the way Arrow
> > fundamentally works so that people can avoid using a dependency. If the
> > dependency is small, open source and easy to build, people can fork it
> and
> > include directly if they want to. Let's not violate project principles
> > because DuckDB has a religious perspective on dependencies. If the
> problem
> > is people have to swallow too large of a pill to do basic things with
> Arrow
> > in C, let's focus on fixing that (to our definition of ease, not someone
> > else's). If FlatCC solves some those things, great. If we need to build a
> > baby integration library that is more C centric, great. Neither of those
> > things require implementing something at the format level.
> >
> > 2) It seems like we should discuss the data structure problem separately
> > from the reference management concern.
> >
> >
> > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmckinn@gmail.com> wrote:
> >
> > > hi Antoine,
> > >
> > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <antoine@python.org>
> wrote:
> > > >
> > > >
> > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > A couple things:
> > > > >
> > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> better
> > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > structs here have very nearly the same "shape" as the data
> structure
> > > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > structure in Flatbuffers would make RecordBatch messages much
> larger,
> > > > > so the flattened / disassembled representation we use for
> serialized
> > > > > record batches is the correct one
> > > >
> > > > I'm not sure I agree:
> > > >
> > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> quite
> > > > closely like the C++ ArrayData object :-)  We have good experience
> with
> > > > that abstraction and it has proven to work quite well
> > > >
> > > > - the IPC format is meant for serialization while the C data
> protocol is
> > > > meants for in-memory communication, so different concerns apply
> > > >
> > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > important at all; we're not talking about transferring data over the
> wire
> > > >
> > > > There's also another argument for having a recursive struct: it
> > > > simplifies how the data type is represented, since we can encode each
> > > > child type individually instead of encoding it in the parent's format
> > > > string (same applies for metadata and individual flags).
> > > >
> > >
> > > I was saying something different here. I was making an argument about
> > > why we use the flattened array-of-structs in the IPC protocol. One
> > > reason is that it's a more compact representation. That is not very
> > > important here because this protocol is only for *in-process* (for
> > > languages that have a C FFI facility) rather than *inter-process*
> > > communication.
> > >
> > > I agree also that the type encoding is simple, here, too, since we
> > > aren't having to split the schema and record batch between different
> > > serialized messages. There is some potential waste with having to
> > > populate the type fields multiple times when communicating a sequence
> > > of "chunks" from the same logical dataset.
> > >
> > > > > * The "formal" C protocol having the "assembled" shape means that
> many
> > > > > minimal Arrow users won't have to implement any separate data
> > > > > structures. They can just use the C struct directly or a slightly
> > > > > wrapped version thereof with some convenience functions.
> > > >
> > > > Yes, but the same applies to the current proposal.
> > > >
> > > > > * I think that requiring building a Flatbuffer for minimal use
> cases
> > > > > (e.g. communicating simple record batches with primitive types)
> passes
> > > > > on implementation burden to minimal users.
> > > >
> > > > It certainly does.
> > > >
> > > > > I think the mantra of the C protocol should be the following:
> > > > >
> > > > > * Users of the protocol have to write little to no code to use it.
> For
> > > > > example, populating an INT32 array should require only a few lines
> of
> > > > > code
> > > >
> > > > Agreed.  As a sidenote, the spec should have an example of doing
> this in
> > > > raw C.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message