From dev-return-15126-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Thu Oct 3 04:28:20 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 133EB18064F for ; Thu, 3 Oct 2019 06:28:19 +0200 (CEST) Received: (qmail 7363 invoked by uid 500); 3 Oct 2019 04:28:17 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 7350 invoked by uid 99); 3 Oct 2019 04:28:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2019 04:28:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 56D11C039D for ; Thu, 3 Oct 2019 04:28:14 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.201 X-Spam-Level: X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 2Dpd7mGvCkPo for ; Thu, 3 Oct 2019 04:28:10 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.65; helo=mail-io1-f65.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-io1-f65.google.com (mail-io1-f65.google.com [209.85.166.65]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 303A7BC5C0 for ; Thu, 3 Oct 2019 04:28:10 +0000 (UTC) Received: by mail-io1-f65.google.com with SMTP id w12so2378421iol.11 for ; Wed, 02 Oct 2019 21:28:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=+As6NuKKy4az8+eTkrqwpUiSrClQcabruyemgl55Nxo=; b=IFY6pDdY6J59IuIbVAfbxD7ILZV0CCQ2WUMn5VU+boC94SUhRnEyBA7B/4hkkyeoe7 1UQVMkxOPpjHoDJLQ0WnwhY++X4JH7q0MCJ8xU6xkx2r5UWtT2sw8c3vazY0fgsMkR/X aLTraUHUuU3fmPSxZcG7MAZGBiR9PWglTmV/pEgLOyhGEKSLDKT7W6nPA/C8xdTGhpsw s+FwcAUlc/oUdahVcxd973u0TCR0WiFv7BN5YXUU7iyuZAb8njwVznkI2AeY/dc5TcZ4 RiGJT9YzJSeNTYHZvVGbD2g4jkJpt1VjKToJHQc3U09AkNi3yRLuWerJb10pB8PAq6fj 44rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=+As6NuKKy4az8+eTkrqwpUiSrClQcabruyemgl55Nxo=; b=Nef/3WrqxRnL2oS+xwNcrKb9UxjkVb42Y7GHkYbi4IK+qNoQl8zNy4JZ56sTKVkART KOSu9QDl9up2El4UoqacjNPK6HuIwnXOThJaSF7H8cBsKNbamCl/HC5EKf1HktYuFuTi doaDnJydyG9Wa1jtVpkVi4yTxZ3VUXiADNammnpk4K3ncX3VI6JQEXsNyIy60O8CgC9d +weyR6PUDwYyuwqXz2YZWKTH8F1vv6ReuLnvgctIToheq/ZXnv5/AOYHlp13+3XQY+jF ud5L1Y9KbCATl1Zr1KDMVXsjClcnmOnqg6TQa4p7XkLkZm4YoMNYseO4y0z5gSfDltMz IO8Q== X-Gm-Message-State: APjAAAUczrQ9OYtXDxMAe1Vpv+cZsMVy9CIbQl7nSIIKjKXwl8ejskIl eESxg9j9DuK7yVaI0wxhDWTUtO9fULRiWzsiqjc= X-Google-Smtp-Source: APXvYqyX1ig18zLOct3Mwpgvuh4+SIZP/bBrbuBPzdW0zbgqSRMGLSSjIFIf/k28CYMXT8kX+Rr2MCaNwRGM4ZDr8eA= X-Received: by 2002:a92:c530:: with SMTP id m16mr8071347ili.44.1570076882368; Wed, 02 Oct 2019 21:28:02 -0700 (PDT) MIME-Version: 1.0 References: <65deced8-d885-b48c-7c87-6f82d4eefc68@python.org> <20190919214936.694d4f87@fsol> <06fb814c-d9ee-7cbd-bcdf-c1729489cdf2@python.org> <587714f5-dea2-48a4-9fbf-2272d9867c04@python.org> <9396e1d3-4ec2-ed7a-6594-57e23ff80586@python.org> <8ba28a18-758a-8f27-2acd-109dc8b26274@python.org> In-Reply-To: From: Wes McKinney Date: Wed, 2 Oct 2019 23:27:24 -0500 Message-ID: Subject: Re: [DISCUSS] C-level in-process array protocol To: Micah Kornfield Cc: dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield wro= te: > > I've tried to summarize my understanding of the debate so far and give so= me > initial thoughts. I think there are two potentially different sets of use= rs > that we are targeting with a stable C API/ABI ourselves and external > parties. > > 1. Different language implementations within the Arrow project that want > to call into each other's code. We still don't have a great story around > this in terms of reusable libraries and questions like [1] are a motivati= ng > examples of making something better in this context. > 2. third-parties wishing to support/integrate with Arrow. Some > conjectures about these users: > - Users in this group are NOT necessarily familiar with existing > technologies Arrow uses (i.e. flatbuffers) > - The stability of the API is the primary concern (consumers don't want > to change when a new version of the library ships) > - An important secondary concern is additional libraries that need to b= e > integrated in addition to the API > > The main debate points seems to be: > > 1. Vector/Array oriented API vs existing Record Batch. Will an addition= al > column oriented API become too much of a maintenance headache/cause > fragmentation? > > - In my mind the question here is which set of users we are prioritizing= . > IMO the combination of flatbuffers and translation to/from RecordBatch > format offers too much friction to make it easy for a third-party > implementer to use. If we are prioritizing for our own internal use-cases= I > think we should try out a RecordBatch+Flatbuffers based C-API. We already > have all the necessary building blocks. > If a C function passes you a string containing a RecordBatch Flatbuffers message, what happens next? This message has to be reassembled into a recursive data structure before you can "do" anything with it. Are we expecting every third party project to implement: A. Data structures appropriate to represent a logical "field" in a record batch (which have to be recursive to account for nested types' children) B. The logic to convert from the flattened Flatbuffers representation to some implementation of A I'm arguing that we should provide both to third parties. To build B, you need A. Some consumers will only use A. This discussion is essentially about developing an ultraminimalist "drop-in" C implementation of A. > 2. How onerous is the dependency on flat-buffers both from a learning > curve perspective and as dependency for third-party integrators? > - Flatbuffers aren't entirely straight-forward and I think if we do move > forward with an API based on Column/Array we should consider alternatives > as long as the necessary parsing code can be done in a small amount of co= de > (I'm personally against JSON for this, but can see the arguments for it). > > 3. Do all existing library implementations need to support both > Column/Array a ABI? How will compliance be checked for the new API/ABI? > > - I'm still thinking this through. > > [1] > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a2508= 8383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau wrote: > > > I'd like to hear more opinions from others on this topic. This conversa= tion > > seems mostly dominated by comments from myself, Wes and Antoine. > > > > I think it is reasonable to argue that keeping any ABI (or header/struc= t > > pattern) as narrow as possible would allow us to minimize overlap with = the > > existing in-memory specification. In Arrow's case, this could be as sim= ple > > as a single memory pointer for schema (backed by flatbuffers) and a sin= gle > > memory location for data (that references the record batch header, whic= h in > > turn provides pointers into the actual arrow data). Extensions would ne= ed > > to be added for reference management as done here but I continue to thi= nk > > we should defer discussion of that until the base data structures are > > resolved. I see the comments here as arguing for a much broader ABI, in > > part to support having people build "Arrow" components that interconnec= t > > using this new interface. I understand the desire to expand the ABI to = be > > driven by needs to reduce dependencies and ease usability. > > > > The representation within the related patch is being presented as a way= for > > applications to share Arrow data but is not easily accessible to all > > languages. I want to avoid a situation where someone says "I produced a= n > > Arrow API" when what they've really done is created a C interface which > > only a small subset of languages can actually leverage. For example, ev= ery > > language now knows how to parse the existing schema definition as rende= red > > in flatbuf. In order to interact with something that implements this ne= w > > pattern one would also be required to implement completely new schema > > consumption code. In the proposal itself it suggests this (for example > > enhancing the C++ library to consume structures produced this way). > > > > As I said, I really want to hear more opinions. Running this past vario= us > > developers I know, many have echoed my concerns but that really doesn't > > matter (and who knows how much of that is colored by my presentation of= the > > issue). What do people here think? If someone builds an "Arrow" library > > that implements this set of structures, how does one use it in Node? In > > Java? Does it drive creation of a secondary set of interfaces in each o= f > > those languages to work with this kind of pattern? (For example, in a J= VM > > view of the world, working with a plain struct in java rather than a se= t of > > memory pointers against our existing IPC formats would be quite painful= and > > we'd definitely need to create some glue code for users. I worry the sa= me > > pattern would occur in many other languages.) > > > > To respond directly to some of Wes's most recent comments from the emai= l > > below. I struggle to map your description of the situation to the rest = of > > the thread and the proposed patch. For example, you say that a non-goa= l is > > "creating a new canonical way to serialize metadata" bute the patch > > proposes a concrete string based encoding system to describe data types= . > > Aren't those things in conflict? > > > > I'll also think more on this and challenge my own perspective. This isn= 't > > where my focus is so my comments aren't as developed/thoughtful as I'd > > like. > > > > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney wrote= : > > > > > hi Jacques, > > > > > > I think we've veered off course a bit and maybe we could reframe the > > > discussion. > > > > > > Goals > > > * A "drop-in" header-only C file that projects can use as a > > > programming interface either internally only or to expose in-memory > > > data structures between C functions at call sites. Ideally little to > > > no disassembly/reassembly should be required on either "side" of the > > > call site. > > > * Simplifying adoption of Arrow for C programmers, or languages based > > > around C FFI > > > > > > Non-goals > > > * Expanding the columnar format or creating an alternative canonical > > > in-memory representation > > > * Creating a new canonical way to serialize metadata > > > > > > Note that this use case has been on my mind for more than 2 years: > > > https://issues.apache.org/jira/browse/ARROW-1058 > > > > > > I think there are a couple of potentially misleading things at play h= ere > > > > > > 1. The use of the word "protocol". In C, a struct has a well-defined > > > binary layout, so a C API is also an ABI. Using C structs to > > > communicate data can be considered to be a protocol, but it means > > > something different in the context of the "Arrow protocol". I think w= e > > > need to call this a "C API" > > > > > > 2. The documentation for this in Antoine's PR is in the format/ > > > directory. It would probably be better to have a "C API" section in > > > the documentation. > > > > > > The header file under discussion and the documentation about it is > > > best considered as a "library". > > > > > > It might be useful at some point to create a C99 implementation of th= e > > > IPC protocol as well using FlatCC with the goal of having a complete > > > implementation of the columnar format in C with minimal binary > > > footprint. This is analogous to the NanoPB project which is an > > > implementation of Protocol Buffers with small code size > > > > > > https://github.com/nanopb/nanopb > > > > > > Let me know if this makes more sense. > > > > > > I think it's important to communicate clearly about this primarily fo= r > > > the benefit of the outside world which can confuse easily as we have > > > observed over the last few years =3D) > > > > > > Wes > > > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau > > wrote: > > > > > > > > I disagree with this statement: > > > > > > > > - the IPC format is meant for serialization while the C data protoc= ol > > is > > > > meants for in-memory communication, so different concerns apply > > > > > > > > If that is how the a particular implementation presents it, that is= a > > > > weaknesses of the implementation, not the format. The primary use c= ase > > I > > > > was focused on when working on the initial format was communication > > > within > > > > the same process. It seems like this is being used as a basis for t= he > > > > introduction of new things when the premise is inconsistent with th= e > > > > intention of the creation. The specific reason we used flatbuffers = in > > the > > > > project was to collapse the separation of in-process and out-of-pro= cess > > > > communication. It means the same thing it does with the Arrow data > > > itself: > > > > that a consumer doesn't have to use a particular library to interac= t > > with > > > > and use the data. > > > > > > > > It seems like there are two ideas here: > > > > > > > > 1) How do we make it easier for people to use Arrow? > > > > 2) Should we implement a new in memory representation of Arrow that= is > > > > language specific. > > > > > > > > I'm entirely in support of number one. If for a particular type of > > > domain, > > > > people want an easier way to interact with Arrow, let's make a new > > > library > > > > that helps with that. In easy of our current libraries, we do many > > things > > > > to make it easier to work with Arrow. None of those require a chang= e to > > > the > > > > core format or are formalized as a new in-memory standard. The > > in-memory > > > > representation of rust or javascript or java objects are implementa= tion > > > > details. > > > > > > > > I'm against number two as it creates a fragmentation problem. Arrow= is > > > > about having a single canonical format for memory for both metadata= and > > > > data. Having multiple in-memory formats (especially when some are n= ot > > > > language independent) is counter to the goals of the project. > > > > > > I don't think anyone is proposing anything that would cause > > fragmentation. > > > > > > A central question is whether it is useful to define a reusable C ABI > > > for the Arrow columnar format, and if there is sufficient interest, a > > > tiny C implementation of the IPC protocol (which uses the Flatbuffers > > > message) that assembles and disassembles the data structures defined > > > in the C ABI. > > > > > > We could separately create a tiny implementation of the Arrow IPC > > > protocol using FlatCC that could be dropped into applications > > > requiring only a C compiler and nothing else. > > > > > > > > > > > > > > Two other, separate comments: > > > > 1) I don't understand the idea that we need to change the way Arrow > > > > fundamentally works so that people can avoid using a dependency. If= the > > > > dependency is small, open source and easy to build, people can fork= it > > > and > > > > include directly if they want to. Let's not violate project princip= les > > > > because DuckDB has a religious perspective on dependencies. If the > > > problem > > > > is people have to swallow too large of a pill to do basic things wi= th > > > Arrow > > > > in C, let's focus on fixing that (to our definition of ease, not > > someone > > > > else's). If FlatCC solves some those things, great. If we need to > > build a > > > > baby integration library that is more C centric, great. Neither of > > those > > > > things require implementing something at the format level. > > > > > > > > 2) It seems like we should discuss the data structure problem > > separately > > > > from the reference management concern. > > > > > > > > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney > > wrote: > > > > > > > > > hi Antoine, > > > > > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou > > > wrote: > > > > > > > > > > > > > > > > > > Le 01/10/2019 =C3=A0 00:39, Wes McKinney a =C3=A9crit : > > > > > > > A couple things: > > > > > > > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be > > > better > > > > > > > to have the same "shape" as an assembled array. Note that the= C > > > > > > > structs here have very nearly the same "shape" as the data > > > structure > > > > > > > representing a C++ Array object [1]. The disassembly and > > reassembly > > > > > > > here is substantially simpler than the IPC protocol. A recurs= ive > > > > > > > structure in Flatbuffers would make RecordBatch messages much > > > larger, > > > > > > > so the flattened / disassembled representation we use for > > > serialized > > > > > > > record batches is the correct one > > > > > > > > > > > > I'm not sure I agree: > > > > > > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct loo= ks > > > quite > > > > > > closely like the C++ ArrayData object :-) We have good experie= nce > > > with > > > > > > that abstraction and it has proven to work quite well > > > > > > > > > > > > - the IPC format is meant for serialization while the C data > > > protocol is > > > > > > meants for in-memory communication, so different concerns apply > > > > > > > > > > > > - the fact that this makes the layout slightly larger doesn't s= eem > > > > > > important at all; we're not talking about transferring data ove= r > > the > > > wire > > > > > > > > > > > > There's also another argument for having a recursive struct: it > > > > > > simplifies how the data type is represented, since we can encod= e > > each > > > > > > child type individually instead of encoding it in the parent's > > format > > > > > > string (same applies for metadata and individual flags). > > > > > > > > > > > > > > > > I was saying something different here. I was making an argument a= bout > > > > > why we use the flattened array-of-structs in the IPC protocol. On= e > > > > > reason is that it's a more compact representation. That is not ve= ry > > > > > important here because this protocol is only for *in-process* (fo= r > > > > > languages that have a C FFI facility) rather than *inter-process* > > > > > communication. > > > > > > > > > > I agree also that the type encoding is simple, here, too, since w= e > > > > > aren't having to split the schema and record batch between differ= ent > > > > > serialized messages. There is some potential waste with having to > > > > > populate the type fields multiple times when communicating a sequ= ence > > > > > of "chunks" from the same logical dataset. > > > > > > > > > > > > * The "formal" C protocol having the "assembled" shape means = that > > > many > > > > > > > minimal Arrow users won't have to implement any separate data > > > > > > > structures. They can just use the C struct directly or a slig= htly > > > > > > > wrapped version thereof with some convenience functions. > > > > > > > > > > > > Yes, but the same applies to the current proposal. > > > > > > > > > > > > > * I think that requiring building a Flatbuffer for minimal us= e > > > cases > > > > > > > (e.g. communicating simple record batches with primitive type= s) > > > passes > > > > > > > on implementation burden to minimal users. > > > > > > > > > > > > It certainly does. > > > > > > > > > > > > > I think the mantra of the C protocol should be the following: > > > > > > > > > > > > > > * Users of the protocol have to write little to no code to us= e > > it. > > > For > > > > > > > example, populating an INT32 array should require only a few > > lines > > > of > > > > > > > code > > > > > > > > > > > > Agreed. As a sidenote, the spec should have an example of doin= g > > > this in > > > > > > raw C. > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > >