arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS] C-level in-process array protocol
Date Tue, 08 Oct 2019 20:36:42 GMT
On Tue, Oct 8, 2019 at 3:34 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> hi Jacques,
>
> On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau <jacques@apache.org> wrote:
> >
> > I removing all my objections to this work.
> >
> > I wish there was more feedback from additional community members. I continue to
be concerned about fragmentation. I don't agree with the arguments here that we need to add
a new api to make it easy for people to *not* use Arrow codebase. It seems like a punt on
building useful libraries within the project that will ultimately hurt the interoperability
story.
> >
>
> I think we'll have to take a "wait and see" approach. I believe the
> community needs to build accessible libraries that offer value to
> third party users, and we will continue to do that. I think there are
> use cases here that fall outside of which library to use, but time
> will tell.
>
> > As a side note, it seems like much of this is about people's distaste for flatbuffers.
I know I regret using it. If we had a chance to do it over again, I would have chosen to use
protobuf for everything except the data header, where I would hand write the encoding (since
it is so simple anyway). If it is such a problem that people are contorting to work around
it, maybe we should address that? Just a thought.
> >
>
> I think that using an Protobuf-like with IDL and a compiler presents a problem.

To clarify some inarticulate language since people reading may misinterpret.

Using an IDL-based metadata representation _in this C API_ presents a
potential roadblock for users.

As a canonical metadata representation with backward and forward
compatibility guarantees, it would be ill-advised to not use
Protobuf/Flatbuffers/Thrift

> Note that Flatbuffers is much better for C/C++ programmers and I still
> think it was the right choice for the project. Unlike Flatbuffers,
> C/C++ applications must either link libprotobuf.so or libprotobuf.a.
> Flatbuffers in C++ is a header-only dependency that is trivial to
> bundle with a project [1]. The same is true for Thrift, and this came
> up in the TF discussion [2]
>
> [1]: https://github.com/apache/arrow/tree/master/cpp/thirdparty/flatbuffers/include/flatbuffers
> [2]: https://github.com/tensorflow/community/pull/162#discussion_r332610486
>
> > Thanks for the discourse and patience.
> >
> > On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <emkornfield@gmail.com> wrote:
> >>
> >> Hi Wes,
> >> I agree for third-parties "A" (Field data structures) is the most useful.
> >>
> >> At least in my mind the discussion was for both first and third-parties.  I
> >> was trying to point out that "A" is less necessary as a first step for
> >> first-party integrations and could potentially require more effort if we
> >> already have the code that does "B" (field reassembly).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <wesmckinn@gmail.com> wrote:
> >>
> >> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <emkornfield@gmail.com>
> >> > wrote:
> >> > >
> >> > > I've tried to summarize my understanding of the debate so far and
give
> >> > some
> >> > > initial thoughts. I think there are two potentially different sets
of
> >> > users
> >> > > that we are targeting with a stable C API/ABI ourselves and external
> >> > > parties.
> >> > >
> >> > > 1.  Different language implementations within the Arrow project that
want
> >> > > to call into each other's code.  We still don't have a great story
around
> >> > > this in terms of reusable libraries and questions like [1] are a
> >> > motivating
> >> > > examples of making something better in this context.
> >> > > 2.  third-parties wishing to support/integrate with Arrow.  Some
> >> > > conjectures about these users:
> >> > >   - Users in this group are NOT necessarily familiar with existing
> >> > > technologies Arrow uses (i.e. flatbuffers)
> >> > >   - The stability of the API is the primary concern (consumers don't
want
> >> > > to change when a new version of the library ships)
> >> > >   - An important secondary concern is additional libraries that need
to
> >> > be
> >> > > integrated in addition to the API
> >> > >
> >> > > The main debate points seems to be:
> >> > >
> >> > > 1.  Vector/Array oriented API vs existing Record Batch.  Will an
> >> > additional
> >> > > column oriented API become too much of a maintenance headache/cause
> >> > > fragmentation?
> >> > >
> >> > >  - In my mind the question here is which set of users we are
> >> > prioritizing.
> >> > > IMO the combination of flatbuffers and translation to/from RecordBatch
> >> > > format offers too much friction to make it easy for a third-party
> >> > > implementer to use. If we are prioritizing for our own internal
> >> > use-cases I
> >> > > think we should try out a RecordBatch+Flatbuffers based C-API. We
already
> >> > > have all the necessary building blocks.
> >> > >
> >> >
> >> > If a C function passes you a string containing a RecordBatch
> >> > Flatbuffers message, what happens next? This message has to be
> >> > reassembled into a recursive data structure before you can "do"
> >> > anything with it. Are we expecting every third party project to
> >> > implement:
> >> >
> >> > A. Data structures appropriate to represent a logical "field" in a
> >> > record batch (which have to be recursive to account for nested types'
> >> > children)
> >> > B. The logic to convert from the flattened Flatbuffers representation
> >> > to some implementation of A
> >> >
> >> > I'm arguing that we should provide both to third parties. To build B,
> >> > you need A. Some consumers will only use A. This discussion is
> >> > essentially about developing an ultraminimalist "drop-in" C
> >> > implementation of A.
> >> >
> >> > > 2.  How onerous is the dependency on flat-buffers both from a learning
> >> > > curve perspective and as dependency for third-party integrators?
> >> > > - Flatbuffers aren't entirely straight-forward and I think if we do
move
> >> > > forward with an API based on Column/Array we should consider alternatives
> >> > > as long as the necessary parsing code can be done in a small amount
of
> >> > code
> >> > > (I'm personally against JSON for this, but can see the arguments for
it).
> >> > >
> >> > > 3.  Do all existing library implementations need to support both
> >> > > Column/Array a ABI?  How will compliance be checked for the new API/ABI?
> >> > >
> >> > > - I'm still thinking this through.
> >> > >
> >> > > [1]
> >> > >
> >> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E
> >> > >
> >> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <jacques@apache.org>
> >> > wrote:
> >> > >
> >> > > > I'd like to hear more opinions from others on this topic. This
> >> > conversation
> >> > > > seems mostly dominated by comments from myself, Wes and Antoine.
> >> > > >
> >> > > > I think it is reasonable to argue that keeping any ABI (or
> >> > header/struct
> >> > > > pattern) as narrow as possible would allow us to minimize overlap
with
> >> > the
> >> > > > existing in-memory specification. In Arrow's case, this could
be as
> >> > simple
> >> > > > as a single memory pointer for schema (backed by flatbuffers)
and a
> >> > single
> >> > > > memory location for data (that references the record batch header,
> >> > which in
> >> > > > turn provides pointers into the actual arrow data). Extensions
would
> >> > need
> >> > > > to be added for reference management as done here but I continue
to
> >> > think
> >> > > > we should defer discussion of that until the base data structures
are
> >> > > > resolved. I see the comments here as arguing for a much broader
ABI, in
> >> > > > part to support having people build "Arrow" components that
> >> > interconnect
> >> > > > using this new interface. I understand the desire to expand the
ABI to
> >> > be
> >> > > > driven by needs to reduce dependencies and ease usability.
> >> > > >
> >> > > > The representation within the related patch is being presented
as a
> >> > way for
> >> > > > applications to share Arrow data but is not easily accessible
to all
> >> > > > languages. I want to avoid a situation where someone says "I
produced
> >> > an
> >> > > > Arrow API" when what they've really done is created a C interface
which
> >> > > > only a small subset of languages can actually leverage. For example,
> >> > every
> >> > > > language now knows how to parse the existing schema definition
as
> >> > rendered
> >> > > > in flatbuf. In order to interact with something that implements
this
> >> > new
> >> > > > pattern one would also be required to implement completely new
schema
> >> > > > consumption code. In the proposal itself it suggests this (for
example
> >> > > > enhancing the C++ library to consume structures produced this
way).
> >> > > >
> >> > > > As I said, I really want to hear more opinions. Running this
past
> >> > various
> >> > > > developers I know, many have echoed my concerns but that really
doesn't
> >> > > > matter (and who knows how much of that is colored by my presentation
> >> > of the
> >> > > > issue). What do people here think? If someone builds an "Arrow"
library
> >> > > > that implements this set of structures, how does one use it in
Node? In
> >> > > > Java? Does it drive creation of a secondary set of interfaces
in each
> >> > of
> >> > > > those languages to work with this kind of pattern? (For example,
in a
> >> > JVM
> >> > > > view of the world, working with a plain struct in java rather
than a
> >> > set of
> >> > > > memory pointers against our existing IPC formats would be quite
> >> > painful and
> >> > > > we'd definitely need to create some glue code for users. I worry
the
> >> > same
> >> > > > pattern would occur in many other languages.)
> >> > > >
> >> > > > To respond directly to some of Wes's most recent comments from
the
> >> > email
> >> > > > below. I struggle to map your description of the situation to
the rest
> >> > of
> >> > > > the thread and the proposed patch.  For example, you say that
a
> >> > non-goal is
> >> > > > "creating a new canonical way to serialize metadata" bute the
patch
> >> > > > proposes a concrete string based encoding system to describe
data
> >> > types.
> >> > > > Aren't those things in conflict?
> >> > > >
> >> > > > I'll also think more on this and challenge my own perspective.
This
> >> > isn't
> >> > > > where my focus is so my comments aren't as developed/thoughtful
as I'd
> >> > > > like.
> >> > > >
> >> > > >
> >> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <wesmckinn@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > hi Jacques,
> >> > > > >
> >> > > > > I think we've veered off course a bit and maybe we could
reframe the
> >> > > > > discussion.
> >> > > > >
> >> > > > > Goals
> >> > > > > * A "drop-in" header-only C file that projects can use as
a
> >> > > > > programming interface either internally only or to expose
in-memory
> >> > > > > data structures between C functions at call sites. Ideally
little to
> >> > > > > no disassembly/reassembly should be required on either "side"
of the
> >> > > > > call site.
> >> > > > > * Simplifying adoption of Arrow for C programmers, or languages
based
> >> > > > > around C FFI
> >> > > > >
> >> > > > > Non-goals
> >> > > > > * Expanding the columnar format or creating an alternative
canonical
> >> > > > > in-memory representation
> >> > > > > * Creating a new canonical way to serialize metadata
> >> > > > >
> >> > > > > Note that this use case has been on my mind for more than
2 years:
> >> > > > > https://issues.apache.org/jira/browse/ARROW-1058
> >> > > > >
> >> > > > > I think there are a couple of potentially misleading things
at play
> >> > here
> >> > > > >
> >> > > > > 1. The use of the word "protocol". In C, a struct has a
well-defined
> >> > > > > binary layout, so a C API is also an ABI. Using C structs
to
> >> > > > > communicate data can be considered to be a protocol, but
it means
> >> > > > > something different in the context of the "Arrow protocol".
I think
> >> > we
> >> > > > > need to call this a "C API"
> >> > > > >
> >> > > > > 2. The documentation for this in Antoine's PR is in the
format/
> >> > > > > directory. It would probably be better to have a "C API"
section in
> >> > > > > the documentation.
> >> > > > >
> >> > > > > The header file under discussion and the documentation about
it is
> >> > > > > best considered as a "library".
> >> > > > >
> >> > > > > It might be useful at some point to create a C99 implementation
of
> >> > the
> >> > > > > IPC protocol as well using FlatCC with the goal of having
a complete
> >> > > > > implementation of the columnar format in C with minimal
binary
> >> > > > > footprint. This is analogous to the NanoPB project which
is an
> >> > > > > implementation of Protocol Buffers with small code size
> >> > > > >
> >> > > > > https://github.com/nanopb/nanopb
> >> > > > >
> >> > > > > Let me know if this makes more sense.
> >> > > > >
> >> > > > > I think it's important to communicate clearly about this
primarily
> >> > for
> >> > > > > the benefit of the outside world which can confuse easily
as we have
> >> > > > > observed over the last few years =)
> >> > > > >
> >> > > > > Wes
> >> > > > >
> >> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <jacques@apache.org>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > I disagree with this statement:
> >> > > > > >
> >> > > > > > - the IPC format is meant for serialization while the
C data
> >> > protocol
> >> > > > is
> >> > > > > > meants for in-memory communication, so different concerns
apply
> >> > > > > >
> >> > > > > > If that is how the a particular implementation presents
it, that
> >> > is a
> >> > > > > > weaknesses of the implementation, not the format. The
primary use
> >> > case
> >> > > > I
> >> > > > > > was focused on when working on the initial format was
communication
> >> > > > > within
> >> > > > > > the same process. It seems like this is being used
as a basis for
> >> > the
> >> > > > > > introduction of new things when the premise is inconsistent
with
> >> > the
> >> > > > > > intention of the creation. The specific reason we used
flatbuffers
> >> > in
> >> > > > the
> >> > > > > > project was to collapse the separation of in-process
and
> >> > out-of-process
> >> > > > > > communication. It means the same thing it does with
the Arrow data
> >> > > > > itself:
> >> > > > > > that a consumer doesn't have to use a particular library
to
> >> > interact
> >> > > > with
> >> > > > > > and use the data.
> >> > > > > >
> >> > > > > > It seems like there are two ideas here:
> >> > > > > >
> >> > > > > > 1) How do we make it easier for people to use Arrow?
> >> > > > > > 2) Should we implement a new in memory representation
of Arrow
> >> > that is
> >> > > > > > language specific.
> >> > > > > >
> >> > > > > > I'm entirely in support of number one. If for a particular
type of
> >> > > > > domain,
> >> > > > > > people want an easier way to interact with Arrow, let's
make a new
> >> > > > > library
> >> > > > > > that helps with that. In easy of our current libraries,
we do many
> >> > > > things
> >> > > > > > to make it easier to work with Arrow. None of those
require a
> >> > change to
> >> > > > > the
> >> > > > > > core format or are formalized as a new in-memory standard.
The
> >> > > > in-memory
> >> > > > > > representation of rust or javascript or java objects
are
> >> > implementation
> >> > > > > > details.
> >> > > > > >
> >> > > > > > I'm against number two as it creates a fragmentation
problem.
> >> > Arrow is
> >> > > > > > about having a single canonical format for memory for
both
> >> > metadata and
> >> > > > > > data. Having multiple in-memory formats (especially
when some are
> >> > not
> >> > > > > > language independent) is counter to the goals of the
project.
> >> > > > >
> >> > > > > I don't think anyone is proposing anything that would cause
> >> > > > fragmentation.
> >> > > > >
> >> > > > > A central question is whether it is useful to define a reusable
C ABI
> >> > > > > for the Arrow columnar format, and if there is sufficient
interest, a
> >> > > > > tiny C implementation of the IPC protocol (which uses the
Flatbuffers
> >> > > > > message) that assembles and disassembles the data structures
defined
> >> > > > > in the C ABI.
> >> > > > >
> >> > > > > We could separately create a tiny implementation of the
Arrow IPC
> >> > > > > protocol using FlatCC that could be dropped into applications
> >> > > > > requiring only a C compiler and nothing else.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > Two other, separate comments:
> >> > > > > > 1) I don't understand the idea that we need to change
the way Arrow
> >> > > > > > fundamentally works so that people can avoid using
a dependency.
> >> > If the
> >> > > > > > dependency is small, open source and easy to build,
people can
> >> > fork it
> >> > > > > and
> >> > > > > > include directly if they want to. Let's not violate
project
> >> > principles
> >> > > > > > because DuckDB has a religious perspective on dependencies.
If the
> >> > > > > problem
> >> > > > > > is people have to swallow too large of a pill to do
basic things
> >> > with
> >> > > > > Arrow
> >> > > > > > in C, let's focus on fixing that (to our definition
of ease, not
> >> > > > someone
> >> > > > > > else's). If FlatCC solves some those things, great.
If we need to
> >> > > > build a
> >> > > > > > baby integration library that is more C centric, great.
Neither of
> >> > > > those
> >> > > > > > things require implementing something at the format
level.
> >> > > > > >
> >> > > > > > 2) It seems like we should discuss the data structure
problem
> >> > > > separately
> >> > > > > > from the reference management concern.
> >> > > > > >
> >> > > > > >
> >> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmckinn@gmail.com>
> >> > > > wrote:
> >> > > > > >
> >> > > > > > > hi Antoine,
> >> > > > > > >
> >> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou
<
> >> > antoine@python.org>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit
:
> >> > > > > > > > > A couple things:
> >> > > > > > > > >
> >> > > > > > > > > * I think a C protocol / FFI for Arrow
array/vectors would be
> >> > > > > better
> >> > > > > > > > > to have the same "shape" as an assembled
array. Note that
> >> > the C
> >> > > > > > > > > structs here have very nearly the same
"shape" as the data
> >> > > > > structure
> >> > > > > > > > > representing a C++ Array object [1].
The disassembly and
> >> > > > reassembly
> >> > > > > > > > > here is substantially simpler than the
IPC protocol. A
> >> > recursive
> >> > > > > > > > > structure in Flatbuffers would make
RecordBatch messages much
> >> > > > > larger,
> >> > > > > > > > > so the flattened / disassembled representation
we use for
> >> > > > > serialized
> >> > > > > > > > > record batches is the correct one
> >> > > > > > > >
> >> > > > > > > > I'm not sure I agree:
> >> > > > > > > >
> >> > > > > > > > - indeed, it's not a coincidence that the
ArrowArray struct
> >> > looks
> >> > > > > quite
> >> > > > > > > > closely like the C++ ArrayData object :-)
 We have good
> >> > experience
> >> > > > > with
> >> > > > > > > > that abstraction and it has proven to work
quite well
> >> > > > > > > >
> >> > > > > > > > - the IPC format is meant for serialization
while the C data
> >> > > > > protocol is
> >> > > > > > > > meants for in-memory communication, so different
concerns apply
> >> > > > > > > >
> >> > > > > > > > - the fact that this makes the layout slightly
larger doesn't
> >> > seem
> >> > > > > > > > important at all; we're not talking about
transferring data
> >> > over
> >> > > > the
> >> > > > > wire
> >> > > > > > > >
> >> > > > > > > > There's also another argument for having
a recursive struct: it
> >> > > > > > > > simplifies how the data type is represented,
since we can
> >> > encode
> >> > > > each
> >> > > > > > > > child type individually instead of encoding
it in the parent's
> >> > > > format
> >> > > > > > > > string (same applies for metadata and individual
flags).
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > I was saying something different here. I was making
an argument
> >> > about
> >> > > > > > > why we use the flattened array-of-structs in the
IPC protocol.
> >> > One
> >> > > > > > > reason is that it's a more compact representation.
That is not
> >> > very
> >> > > > > > > important here because this protocol is only for
*in-process*
> >> > (for
> >> > > > > > > languages that have a C FFI facility) rather than
*inter-process*
> >> > > > > > > communication.
> >> > > > > > >
> >> > > > > > > I agree also that the type encoding is simple,
here, too, since
> >> > we
> >> > > > > > > aren't having to split the schema and record batch
between
> >> > different
> >> > > > > > > serialized messages. There is some potential waste
with having to
> >> > > > > > > populate the type fields multiple times when communicating
a
> >> > sequence
> >> > > > > > > of "chunks" from the same logical dataset.
> >> > > > > > >
> >> > > > > > > > > * The "formal" C protocol having the
"assembled" shape means
> >> > that
> >> > > > > many
> >> > > > > > > > > minimal Arrow users won't have to implement
any separate data
> >> > > > > > > > > structures. They can just use the C
struct directly or a
> >> > slightly
> >> > > > > > > > > wrapped version thereof with some convenience
functions.
> >> > > > > > > >
> >> > > > > > > > Yes, but the same applies to the current
proposal.
> >> > > > > > > >
> >> > > > > > > > > * I think that requiring building a
Flatbuffer for minimal
> >> > use
> >> > > > > cases
> >> > > > > > > > > (e.g. communicating simple record batches
with primitive
> >> > types)
> >> > > > > passes
> >> > > > > > > > > on implementation burden to minimal
users.
> >> > > > > > > >
> >> > > > > > > > It certainly does.
> >> > > > > > > >
> >> > > > > > > > > I think the mantra of the C protocol
should be the following:
> >> > > > > > > > >
> >> > > > > > > > > * Users of the protocol have to write
little to no code to
> >> > use
> >> > > > it.
> >> > > > > For
> >> > > > > > > > > example, populating an INT32 array should
require only a few
> >> > > > lines
> >> > > > > of
> >> > > > > > > > > code
> >> > > > > > > >
> >> > > > > > > > Agreed.  As a sidenote, the spec should have
an example of
> >> > doing
> >> > > > > this in
> >> > > > > > > > raw C.
> >> > > > > > > >
> >> > > > > > > > Regards
> >> > > > > > > >
> >> > > > > > > > Antoine.
> >> > > > > > >
> >> > > > >
> >> > > >
> >> >

Mime
View raw message