From dev-return-15085-archive-asf-public=cust-asf.ponee.io@arrow.apache.org  Wed Oct  2 02:33:08 2019
Return-Path: <dev-return-15085-archive-asf-public=cust-asf.ponee.io@arrow.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 53AF5180608
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  2 Oct 2019 04:33:08 +0200 (CEST)
Received: (qmail 86872 invoked by uid 500); 2 Oct 2019 02:33:06 -0000
Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@arrow.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@arrow.apache.org>
List-Post: <mailto:dev@arrow.apache.org>
List-Id: <dev.arrow.apache.org>
Reply-To: dev@arrow.apache.org
Delivered-To: mailing list dev@arrow.apache.org
Received: (qmail 86860 invoked by uid 99); 2 Oct 2019 02:33:06 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2019 02:33:06 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 851B9C06CC
	for <dev@arrow.apache.org>; Wed,  2 Oct 2019 02:33:05 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -0.201
X-Spam-Level:
X-Spam-Status: No, score=-0.201 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id czY3DMvkBELT for <dev@arrow.apache.org>;
	Wed,  2 Oct 2019 02:33:03 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.166.66; helo=mail-io1-f66.google.com; envelope-from=wesmckinn@gmail.com; receiver=<UNKNOWN> 
Received: from mail-io1-f66.google.com (mail-io1-f66.google.com [209.85.166.66])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id F21FFBC575
	for <dev@arrow.apache.org>; Wed,  2 Oct 2019 02:33:02 +0000 (UTC)
Received: by mail-io1-f66.google.com with SMTP id q1so53436239ion.1
        for <dev@arrow.apache.org>; Tue, 01 Oct 2019 19:33:02 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :content-transfer-encoding;
        bh=mkyipZWG93X9CL0oMWTkE3/nDTeZmv2lDXkWbQOR2rY=;
        b=blfozNnShgdzc9TVHZMHlkbGJHLfnUhKOgomk/BXa+nTJM9qFIej0AwdhpA204hqsV
         7UUn9ZyRiqjhFF2RboJJvo3sG26FsJ1tpNUeZgUk4WCDHpN6bFDE8pj4qFvdTBv1yKU6
         z4/zM8B+K3am4rxPlnM24IllGryp7mpHq1nlMOX42Jt1IihDwXOVnwNtViWEWmPpsB/3
         RomGMB/22viikN45BViSI/MVy09l2wL1jQXTagMYSO/9Y0WpYXpe/t3SeQIBj82T1Ehd
         rlpTKKsO0p/Ft+aCgqs9RYzhwD85OJCEfiOvjGBDGkLyCgZYvwsYImQ8s5b2hFZPNk3X
         mL7w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:content-transfer-encoding;
        bh=mkyipZWG93X9CL0oMWTkE3/nDTeZmv2lDXkWbQOR2rY=;
        b=bJrmCyTCJtVgELQesa14QbdWZU9SNR94UcvYjst2qDQSG2BOb4HpyFegKYRGYii4EJ
         OUlj/82VXOtxNvjSDiSLbpVKvCQCwgfVUAcSY6XNNmTYVGC39q98Mo8yRYxWbsrERbmm
         zFU3y5d99z6ZZnSUsvacLffrHPq56Eox1RW7nDZlyqduF3H3bfRY6kTDQJIjJIsum32C
         bZeeJYbVkxzNxES7Uy4cldUxD6152Muqh3ci7RQWeoGPp4bjlNJboezK8Lyz/9qbYpXZ
         18uuUmzu9Ck5cHeMTkKyILe6BT7x/9bmxGqKa81E2HHqQC7uIYjMK399xnu748910Jrc
         LuIg==
X-Gm-Message-State: APjAAAV2Mk7/3pMC7r3cjrUIS6pqL2TE43IjOw0JmTr5ORa4VSwc01rr
	hQJaV5vstq+oRl8RmlDXZLKayTR91HVXMIOJzj9+NTFKEoc=
X-Google-Smtp-Source: APXvYqwSBqnQmXo/DvhFY89nrPMRyDe8ipthhHLPs2qG9HXlw1nCy9dY1HJlTqd40RZJ7c96MyFqY2yTKYkzEl+Nonw=
X-Received: by 2002:a05:6638:350:: with SMTP id x16mr1649175jap.29.1569983581130;
 Tue, 01 Oct 2019 19:33:01 -0700 (PDT)
MIME-Version: 1.0
References: <65deced8-d885-b48c-7c87-6f82d4eefc68@python.org>
 <20190919214936.694d4f87@fsol> <CAJPUwMAV9kYaYzcW5=1EDW_u33+8RwJOLpPwjhXY3gwwDKkiDg@mail.gmail.com>
 <CAKa9qDn3EPmdS8VAMykeJSEb=SMhbza0WPSmBR=QWs8Uc8ao3g@mail.gmail.com>
 <CAKa9qDm0TyT=SPq3H3+wc-gkGzKzfVLPHEJw+U5FqzV3d0d1dA@mail.gmail.com>
 <06fb814c-d9ee-7cbd-bcdf-c1729489cdf2@python.org> <CAJPUwMAUchC8LZ_iekeAu2TsLcdyw0nR8C0wJ=R8i_eHA2tsBA@mail.gmail.com>
 <587714f5-dea2-48a4-9fbf-2272d9867c04@python.org> <CALPsrS0d4JaDJyE4ySUmimYp0hrvEOGysaLxVb7vkKfFHK0w=g@mail.gmail.com>
 <9396e1d3-4ec2-ed7a-6594-57e23ff80586@python.org> <CAJPUwMCa9+kw3ReBUg-DHAFVffbxTdt15Ss+5n5vZSxBxz9sXA@mail.gmail.com>
 <8ba28a18-758a-8f27-2acd-109dc8b26274@python.org> <CAJPUwMAzchkCm-bkcXhKRzdu0D9jv+iJjJfm7uN2REY=LMZoYA@mail.gmail.com>
 <CAKa9qDm=idEtzL1-0XXUTcvJ-JEzCsdNisaooayLbZMnWsHs9Q@mail.gmail.com>
In-Reply-To: <CAKa9qDm=idEtzL1-0XXUTcvJ-JEzCsdNisaooayLbZMnWsHs9Q@mail.gmail.com>
From: Wes McKinney <wesmckinn@gmail.com>
Date: Tue, 1 Oct 2019 21:32:26 -0500
Message-ID: <CAJPUwMCqc-4KXe_d2goP6q29LLWUvV_Qivcp3Eu3w+5WmeVRLQ@mail.gmail.com>
Subject: Re: [DISCUSS] C-level in-process array protocol
To: dev <dev@arrow.apache.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

hi Jacques,

I think we've veered off course a bit and maybe we could reframe the discus=
sion.

Goals
* A "drop-in" header-only C file that projects can use as a
programming interface either internally only or to expose in-memory
data structures between C functions at call sites. Ideally little to
no disassembly/reassembly should be required on either "side" of the
call site.
* Simplifying adoption of Arrow for C programmers, or languages based
around C FFI

Non-goals
* Expanding the columnar format or creating an alternative canonical
in-memory representation
* Creating a new canonical way to serialize metadata

Note that this use case has been on my mind for more than 2 years:
https://issues.apache.org/jira/browse/ARROW-1058

I think there are a couple of potentially misleading things at play here

1. The use of the word "protocol". In C, a struct has a well-defined
binary layout, so a C API is also an ABI. Using C structs to
communicate data can be considered to be a protocol, but it means
something different in the context of the "Arrow protocol". I think we
need to call this a "C API"

2. The documentation for this in Antoine's PR is in the format/
directory. It would probably be better to have a "C API" section in
the documentation.

The header file under discussion and the documentation about it is
best considered as a "library".

It might be useful at some point to create a C99 implementation of the
IPC protocol as well using FlatCC with the goal of having a complete
implementation of the columnar format in C with minimal binary
footprint. This is analogous to the NanoPB project which is an
implementation of Protocol Buffers with small code size

https://github.com/nanopb/nanopb

Let me know if this makes more sense.

I think it's important to communicate clearly about this primarily for
the benefit of the outside world which can confuse easily as we have
observed over the last few years =3D)

Wes

On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <jacques@apache.org> wrote:
>
> I disagree with this statement:
>
> - the IPC format is meant for serialization while the C data protocol is
> meants for in-memory communication, so different concerns apply
>
> If that is how the a particular implementation presents it, that is a
> weaknesses of the implementation, not the format. The primary use case I
> was focused on when working on the initial format was communication withi=
n
> the same process. It seems like this is being used as a basis for the
> introduction of new things when the premise is inconsistent with the
> intention of the creation. The specific reason we used flatbuffers in the
> project was to collapse the separation of in-process and out-of-process
> communication. It means the same thing it does with the Arrow data itself=
:
> that a consumer doesn't have to use a particular library to interact with
> and use the data.
>
> It seems like there are two ideas here:
>
> 1) How do we make it easier for people to use Arrow?
> 2) Should we implement a new in memory representation of Arrow that is
> language specific.
>
> I'm entirely in support of number one. If for a particular type of domain=
,
> people want an easier way to interact with Arrow, let's make a new librar=
y
> that helps with that. In easy of our current libraries, we do many things
> to make it easier to work with Arrow. None of those require a change to t=
he
> core format or are formalized as a new in-memory standard. The in-memory
> representation of rust or javascript or java objects are implementation
> details.
>
> I'm against number two as it creates a fragmentation problem. Arrow is
> about having a single canonical format for memory for both metadata and
> data. Having multiple in-memory formats (especially when some are not
> language independent) is counter to the goals of the project.

I don't think anyone is proposing anything that would cause fragmentation.

A central question is whether it is useful to define a reusable C ABI
for the Arrow columnar format, and if there is sufficient interest, a
tiny C implementation of the IPC protocol (which uses the Flatbuffers
message) that assembles and disassembles the data structures defined
in the C ABI.

We could separately create a tiny implementation of the Arrow IPC
protocol using FlatCC that could be dropped into applications
requiring only a C compiler and nothing else.


>
> Two other, separate comments:
> 1) I don't understand the idea that we need to change the way Arrow
> fundamentally works so that people can avoid using a dependency. If the
> dependency is small, open source and easy to build, people can fork it an=
d
> include directly if they want to. Let's not violate project principles
> because DuckDB has a religious perspective on dependencies. If the proble=
m
> is people have to swallow too large of a pill to do basic things with Arr=
ow
> in C, let's focus on fixing that (to our definition of ease, not someone
> else's). If FlatCC solves some those things, great. If we need to build a
> baby integration library that is more C centric, great. Neither of those
> things require implementing something at the format level.
>
> 2) It seems like we should discuss the data structure problem separately
> from the reference management concern.
>
>
> On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > hi Antoine,
> >
> > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <antoine@python.org> wrot=
e:
> > >
> > >
> > > Le 01/10/2019 =C3=A0 00:39, Wes McKinney a =C3=A9crit :
> > > > A couple things:
> > > >
> > > > * I think a C protocol / FFI for Arrow array/vectors would be bette=
r
> > > > to have the same "shape" as an assembled array. Note that the C
> > > > structs here have very nearly the same "shape" as the data structur=
e
> > > > representing a C++ Array object [1]. The disassembly and reassembly
> > > > here is substantially simpler than the IPC protocol. A recursive
> > > > structure in Flatbuffers would make RecordBatch messages much large=
r,
> > > > so the flattened / disassembled representation we use for serialize=
d
> > > > record batches is the correct one
> > >
> > > I'm not sure I agree:
> > >
> > > - indeed, it's not a coincidence that the ArrowArray struct looks qui=
te
> > > closely like the C++ ArrayData object :-)  We have good experience wi=
th
> > > that abstraction and it has proven to work quite well
> > >
> > > - the IPC format is meant for serialization while the C data protocol=
 is
> > > meants for in-memory communication, so different concerns apply
> > >
> > > - the fact that this makes the layout slightly larger doesn't seem
> > > important at all; we're not talking about transferring data over the =
wire
> > >
> > > There's also another argument for having a recursive struct: it
> > > simplifies how the data type is represented, since we can encode each
> > > child type individually instead of encoding it in the parent's format
> > > string (same applies for metadata and individual flags).
> > >
> >
> > I was saying something different here. I was making an argument about
> > why we use the flattened array-of-structs in the IPC protocol. One
> > reason is that it's a more compact representation. That is not very
> > important here because this protocol is only for *in-process* (for
> > languages that have a C FFI facility) rather than *inter-process*
> > communication.
> >
> > I agree also that the type encoding is simple, here, too, since we
> > aren't having to split the schema and record batch between different
> > serialized messages. There is some potential waste with having to
> > populate the type fields multiple times when communicating a sequence
> > of "chunks" from the same logical dataset.
> >
> > > > * The "formal" C protocol having the "assembled" shape means that m=
any
> > > > minimal Arrow users won't have to implement any separate data
> > > > structures. They can just use the C struct directly or a slightly
> > > > wrapped version thereof with some convenience functions.
> > >
> > > Yes, but the same applies to the current proposal.
> > >
> > > > * I think that requiring building a Flatbuffer for minimal use case=
s
> > > > (e.g. communicating simple record batches with primitive types) pas=
ses
> > > > on implementation burden to minimal users.
> > >
> > > It certainly does.
> > >
> > > > I think the mantra of the C protocol should be the following:
> > > >
> > > > * Users of the protocol have to write little to no code to use it. =
For
> > > > example, populating an INT32 array should require only a few lines =
of
> > > > code
> > >
> > > Agreed.  As a sidenote, the spec should have an example of doing this=
 in
> > > raw C.
> > >
> > > Regards
> > >
> > > Antoine.
> >