Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 966C9200B67 for ; Tue, 16 Aug 2016 13:04:02 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 94F65160AA8; Tue, 16 Aug 2016 11:04:02 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B1C6D160A76 for ; Tue, 16 Aug 2016 13:04:01 +0200 (CEST) Received: (qmail 18557 invoked by uid 500); 16 Aug 2016 11:04:00 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 18524 invoked by uid 99); 16 Aug 2016 11:04:00 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 Aug 2016 11:04:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 1AD71180525 for ; Tue, 16 Aug 2016 11:04:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.179 X-Spam-Level: ** X-Spam-Status: No, score=2.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id zZ9o5ePmBR5y for ; Tue, 16 Aug 2016 11:03:57 +0000 (UTC) Received: from mail-it0-f50.google.com (mail-it0-f50.google.com [209.85.214.50]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 87BB05FACE for ; Tue, 16 Aug 2016 11:03:57 +0000 (UTC) Received: by mail-it0-f50.google.com with SMTP id e63so30187930ith.1 for ; Tue, 16 Aug 2016 04:03:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=1yIS0/kKoMazdmzYVzecVL1Yy+c6wMaGSel0x/RmZOA=; b=id7+8NqUnzyNrspyfFzgHoz3GUX47blprb5WUsGhQ5wQbbkzLBMmacVPZ3LAHk3uDf Z7tCDk3xU5HVv06RrmHK4ECOx5/JeoXOlPkgDJ8YCaUDZxMQbZmSXVOeOxSbnW22oYSa kXh7VsFzCPIZU2XxGgzGRv93s2Io4BorTR1qxtF5rIk/uTPBFe7YstL05TV09RB4LW/y vUgU/v++xRFWOPkVbSuP0zbJ+C9761XiuqQRUyLmCxT9ffH8IydAzdLqlZeZQQX9clSl oUjEzlLNkrE8hLY57N5ACHqhfBE8Xs+T1RbMV42Qgq1W97OCMDYbpLAMu8qwinyHo2qd FkWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=1yIS0/kKoMazdmzYVzecVL1Yy+c6wMaGSel0x/RmZOA=; b=Vh+hJcyKIHI/uQRgXY4MZ/k/ZX8zBkj6CDqP0Uo1YEcUHNxVDED/DG3UcbCLMc/R24 sH8lXaH0SW+mlSfuTZRh73dz7FHprAak4Cx6cyqgW9KspnQ86OsbLYa28AK5UfpJatpe 006UMfQrLsI8nrFGo2ejd+hd3qeEbOXXBzYdabCx896N3Hlm8U1qcEICwL+j2ootYqF+ PR1GSmnec2/Ew9wduoFS9uOIUETawpOBSWrUaftNZuFXtnS+RKVF7YbZj40VhekorpYB rkov+ldsivVe84acP9t97ELv+o/0Cx/tG1bsS1ABD/mgDjAlk2KPwqt7nSi7lR7notb5 0FFA== X-Gm-Message-State: AEkoout/tM5QYtLXPxGIRozxmGPB7CzDmoeRoiw4JkIUk1GaBSJCh2SloDbFHI7W8T6m/oZxPqmH0+HwjZ/dFg== X-Received: by 10.36.14.20 with SMTP id 20mr5146220ite.88.1471345431133; Tue, 16 Aug 2016 04:03:51 -0700 (PDT) MIME-Version: 1.0 Received: by 10.36.142.5 with HTTP; Tue, 16 Aug 2016 04:03:10 -0700 (PDT) In-Reply-To: References: From: Wes McKinney Date: Tue, 16 Aug 2016 04:03:10 -0700 Message-ID: Subject: Re: Discussion: Should we make string/binary types first class Arrow Array types? To: Julien Le Dem Cc: Micah Kornfield , dev@arrow.apache.org, Jacques Nadeau Content-Type: multipart/alternative; boundary=001a1143840ec3c897053a2e4daa archived-at: Tue, 16 Aug 2016 11:04:02 -0000 --001a1143840ec3c897053a2e4daa Content-Type: text/plain; charset=UTF-8 See ARROW-262 On Mon, Aug 15, 2016 at 3:38 PM, Wes McKinney wrote: > These IPC details we should definitely document outside of the code. > > For the String/Binary type question, I want to start a document that > explains the logical data types in Message.fbs in terms of > > - what Arrow memory layout they use (for example: Int32 uses fixed bit > width 32 bits), and String uses List (with the restriction that > the inner buffer must not have any nulls, and the validity bitmap is > omitted) > > - any type-specific custom logic around deconstructing or > reconstructing an Arrow container in an IPC/RPC setting. What we have > been debating in this e-mail thread is altering the appearance of > String/Binary representation in a record batch > (https://github.com/apache/arrow/blob/master/format/Message.fbs#L147) > from being identical to List (4 buffers in the flattened buffer > list -- 2 for the list node, 2 for the UInt8 node) to its own > "collapsed" form (3 buffers: bitmap, offsets, data). This means that > any code that is sending/receiving a record batch will need separate > code paths to handle List and String respectively (in the C++ code, we > are currently using the same code path for both) > > For example, changes like ARROW-253 > (https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf > 97c363e08e) > would be documented outside of the code and message IDL. > > I will open one or more JIRAs and write a patch to try to close the > loop on this. > > - Wes > > On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem wrote: > > There's ARROW-258 which is about clarifying difference (if any) in > metadata > > across RPC (sockets), IPC (shared memory) and files. > > The vector layout is the same except in RPC or files they get > concatenated > > together when copied over. > > The metadata should be mostly the same (ideally the same). Buffer offsets > > are relative to the beginning of the body in the context of RPC and file > > start in files. In the context of IPC it looks like we need an extra > page id > > (from Message.fbs). Is this correct? > > > > On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield > > > wrote: > >> > >> Thanks Wes, > >> This makes sense. +1 on the "Logical Types / IPC layout > >> document" is there a JIRA open for this? > >> > >> I'll open a JIRA item to change the inheritance of string/binary in the > >> C++ code base. > >> > >> Thanks, > >> Micah > >> > >> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney > >> wrote: > >>> > >>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield < > emkornfield@gmail.com> > >>> wrote: > >>> > Sorry for the late reply. > >>> > > >>> > This all sounds reasonable to me. But I'm not sure I understand > >>> > exactly > >>> > what you mean by > >>> > > >>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string > >>> >> would be a single array unit in the buffer stream and flattened > Field > >>> >> metadata rather than nested types (2 array units as they are > >>> >> presently). > >>> > > >>> > > >>> > The way I read it this seems to me to contradict the > >>> > cross-implementation as > >>> > "List"? > >>> > > >>> > Thanks, > >>> > Micah > >>> > > >>> > >>> I think we can resolve this by starting a "Logical Types and IPC/RPC > >>> layout" specification document. > >>> > >>> The schema metadata > >>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is, > >>> as I understand it, strictly the domain of logical types. I believe > >>> there is some minor conflation of the notions of primitive physical > >>> types and primitive logical types. > >>> > >>> While String / Binary have identical physical layouts to List >>> not null>, in the domain of logical types and IPC, what we are saying > >>> is that these types are: > >>> > >>> - logically speaking: primitive, non-nested types > >>> - their IPC layout is the flattened version of the nested List > >>> counterpart -- a single Field node having String type (with a null > >>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and > >>> data. Structurally on the wire / in shared memory (compared with > >>> List) the only difference is the Field metadata (since > >>> if null count is 0 for the inner UInt8 values, then there is only a > >>> single buffer) -- one node versus two > >>> > >>> Let me know if this does not make sense. > >>> > >>> To move this forward I propose to begin a Logical Types / IPC layout > >>> document and begin to document the mapping between logical types and > >>> their physical in-memory representation and layout on the wire. > >>> > >>> - Wes > >> > >> > > > > > > > > -- > > Julien > --001a1143840ec3c897053a2e4daa--