From dev-return-15061-archive-asf-public=cust-asf.ponee.io@arrow.apache.org Tue Oct 1 12:42:53 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4575B180608 for ; Tue, 1 Oct 2019 14:42:53 +0200 (CEST) Received: (qmail 78690 invoked by uid 500); 1 Oct 2019 12:42:51 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 78678 invoked by uid 99); 1 Oct 2019 12:42:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Oct 2019 12:42:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0B2D7C1CE2 for ; Tue, 1 Oct 2019 12:42:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.2 X-Spam-Level: X-Spam-Status: No, score=-0.2 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id hpBTTMZqc_1c for ; Tue, 1 Oct 2019 12:42:47 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d41; helo=mail-io1-xd41.google.com; envelope-from=wesmckinn@gmail.com; receiver= Received: from mail-io1-xd41.google.com (mail-io1-xd41.google.com [IPv6:2607:f8b0:4864:20::d41]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 6B2687DE10 for ; Tue, 1 Oct 2019 12:42:46 +0000 (UTC) Received: by mail-io1-xd41.google.com with SMTP id v2so47385512iob.10 for ; Tue, 01 Oct 2019 05:42:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=1lZKwwCJN+GtARIL47IL+ldc7B670MY3GbyUbcMaD5g=; b=IueCRDnd054Sorswnuyd+8xBXpbiiVvC4Ftg8fNm/u4dqjWGpVtyBtD8tkS1h0zszB fqSzjeEt0uZaDk197V78ptUG5xhSQlTwjVdDwtz11vFon9CyZLwypI/cy46G6tfD35KX wO7EazvK2+XpwiFTuPypYtPbF9hi2aRhB0uxzGIsYcXKOJk8odgCKwOZ/lC6J7qGcuME vQJzE8MHeChE7Uij6l9cMULOjCxhIXOi1nIfM+EgBPs2Nu0sm8J+plckhWLZzo71Mjz4 Fy4HYLx5on9XglbkCaLoECDBftbJHgIJmafv7ULhOEkYAvxEWM0qorRuwOUVJ1NoBVIf oyVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=1lZKwwCJN+GtARIL47IL+ldc7B670MY3GbyUbcMaD5g=; b=DPYh0mqlJ5059H2rrpm3a5dFK/x/Qi9WaFxBASSKn3DMJ0oqKYHE2KyaqsdcUzQZx3 045cMNN6tXZQ3uxJmLnWetcFx+JA6TSZ0vlVdZ9D0BYhPEJJ/xhwMnsBx6x0jYSKZ/6o jpAcxWpendKc+2PG1/EEroyn0h4avil1WHNJ0IpnIRKkb3qbTnKz486dJoFgx7WD0ubR n4nGIrfYQBW80WFpBApzjzRNjDtNeXBIyxzS1f/fpXA1YQeU70Jbflg53RohUgIKv0ZV RAl/wI2HRVYMCUPID8FDsPCISu37WtTpYyPjeKNNX+xF2CJxdCYqRRWshQ5+VRUWFlNe Fc3w== X-Gm-Message-State: APjAAAX3Nfx4nA6C9iCr9CXwuxyJ8sf3iNOb4fMX1l0ZdQpTvog3O8wO 5K9KsDpKjq6m9rSLjKo490bXSp8nTgZLDtEXO8vQfkFR X-Google-Smtp-Source: APXvYqz4UxdrlIPLYTCApHR56Tr7v0OIDIguF1rnexRfcV/CFtGUtIA9vl+WuHGDHWttZ2DLnTWmKZh0RKo08JGA89g= X-Received: by 2002:a92:c530:: with SMTP id m16mr26089706ili.44.1569933764548; Tue, 01 Oct 2019 05:42:44 -0700 (PDT) MIME-Version: 1.0 References: <65deced8-d885-b48c-7c87-6f82d4eefc68@python.org> <20190919214936.694d4f87@fsol> <06fb814c-d9ee-7cbd-bcdf-c1729489cdf2@python.org> <587714f5-dea2-48a4-9fbf-2272d9867c04@python.org> <9396e1d3-4ec2-ed7a-6594-57e23ff80586@python.org> <8ba28a18-758a-8f27-2acd-109dc8b26274@python.org> In-Reply-To: <8ba28a18-758a-8f27-2acd-109dc8b26274@python.org> From: Wes McKinney Date: Tue, 1 Oct 2019 07:42:07 -0500 Message-ID: Subject: Re: [DISCUSS] C-level in-process array protocol To: dev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable hi Antoine, On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou wrote: > > > Le 01/10/2019 =C3=A0 00:39, Wes McKinney a =C3=A9crit : > > A couple things: > > > > * I think a C protocol / FFI for Arrow array/vectors would be better > > to have the same "shape" as an assembled array. Note that the C > > structs here have very nearly the same "shape" as the data structure > > representing a C++ Array object [1]. The disassembly and reassembly > > here is substantially simpler than the IPC protocol. A recursive > > structure in Flatbuffers would make RecordBatch messages much larger, > > so the flattened / disassembled representation we use for serialized > > record batches is the correct one > > I'm not sure I agree: > > - indeed, it's not a coincidence that the ArrowArray struct looks quite > closely like the C++ ArrayData object :-) We have good experience with > that abstraction and it has proven to work quite well > > - the IPC format is meant for serialization while the C data protocol is > meants for in-memory communication, so different concerns apply > > - the fact that this makes the layout slightly larger doesn't seem > important at all; we're not talking about transferring data over the wire > > There's also another argument for having a recursive struct: it > simplifies how the data type is represented, since we can encode each > child type individually instead of encoding it in the parent's format > string (same applies for metadata and individual flags). > I was saying something different here. I was making an argument about why we use the flattened array-of-structs in the IPC protocol. One reason is that it's a more compact representation. That is not very important here because this protocol is only for *in-process* (for languages that have a C FFI facility) rather than *inter-process* communication. I agree also that the type encoding is simple, here, too, since we aren't having to split the schema and record batch between different serialized messages. There is some potential waste with having to populate the type fields multiple times when communicating a sequence of "chunks" from the same logical dataset. > > * The "formal" C protocol having the "assembled" shape means that many > > minimal Arrow users won't have to implement any separate data > > structures. They can just use the C struct directly or a slightly > > wrapped version thereof with some convenience functions. > > Yes, but the same applies to the current proposal. > > > * I think that requiring building a Flatbuffer for minimal use cases > > (e.g. communicating simple record batches with primitive types) passes > > on implementation burden to minimal users. > > It certainly does. > > > I think the mantra of the C protocol should be the following: > > > > * Users of the protocol have to write little to no code to use it. For > > example, populating an INT32 array should require only a few lines of > > code > > Agreed. As a sidenote, the spec should have an example of doing this in > raw C. > > Regards > > Antoine.