Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 5FAA3200B63 for ; Mon, 15 Aug 2016 23:05:25 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5E24A160AB9; Mon, 15 Aug 2016 21:05:25 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7E549160A8A for ; Mon, 15 Aug 2016 23:05:24 +0200 (CEST) Received: (qmail 24822 invoked by uid 500); 15 Aug 2016 21:05:23 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 24810 invoked by uid 99); 15 Aug 2016 21:05:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 Aug 2016 21:05:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C5AD31857E4 for ; Mon, 15 Aug 2016 21:05:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=dremio-com.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id AOYlo5teCdeQ for ; Mon, 15 Aug 2016 21:05:20 +0000 (UTC) Received: from mail-ua0-f174.google.com (mail-ua0-f174.google.com [209.85.217.174]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 319085F47A for ; Mon, 15 Aug 2016 21:05:20 +0000 (UTC) Received: by mail-ua0-f174.google.com with SMTP id 97so91825901uav.3 for ; Mon, 15 Aug 2016 14:05:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dremio-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=bKapQPqGd7cPCRqrUc/0X9jdmcXX7HLPvDTtMx6zsx4=; b=HkcOHVFDH3zpgsCGB3Ttc3rXpzHK2fdLw2ca5MLC0GMqMzBLtFZKugAVPIu/QSqZSF dA0waXtnsfdTXYWw8h1MJ+DuW3cfPD6/LT1/I1uy1jqLLK5lsd+8Tksruir7E38zDoRh dG2Zh8n9BP6xfvztFvB2jwYd2axt5z2747x2sIZgsxNPVcyFEWp6mwYQQVhi+3xZ3PlE CQBjoWjThiNsThAKx1Q8g3Xar3pHymtmuxqWpRBkXzdzTpE5X/F7bbY/JzznoMGIVqVO XdI99ewMplDUr3V82aSeWmx6Far3AEfs6HTRq7hE8rP5BkInls6XU+1OL3VBQtDUoejY rhwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=bKapQPqGd7cPCRqrUc/0X9jdmcXX7HLPvDTtMx6zsx4=; b=VH67/wBwC35uJAdX32u5X08jfEVp2jCjF3C6uPqFxoKrvzjjHiENmdJ+SnITFCflUZ Qco1zUBxw1Wql54AGDtx9/00ca1YHzZd7VPe2S1w9dbi7pQyMJ/WxF2Vpm8jXAI+s9Nz SOxOjnYvfIzpTm5zRBMrHh6qYUJNRPBlkcK+BE5Rohn21O9Y5UczIzWOHDm/pai1VWzX vsde760sRSvjTt03xdB0c1YfpWa+ckqHidbKkxKhz/HL4PUdAkKX/A0HwAhLOdf7yCx4 yP2O8mCEMchQdcVNJaLjKBElq/1nazfpeWBLtm/zXZ0I+NiQwz5emQxQyDG59dLmxgbB 5Wvg== X-Gm-Message-State: AEkoouu50toQQM9Qx7ErmQF2umldfwKEutfbiKFDfr6s9XrzzPAIms8rCSQ4KFUZChSaukbUkX7YCE0tRUCN7g== X-Received: by 10.31.162.4 with SMTP id l4mr12113481vke.136.1471295119235; Mon, 15 Aug 2016 14:05:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.123.65 with HTTP; Mon, 15 Aug 2016 14:04:58 -0700 (PDT) In-Reply-To: References: From: Julien Le Dem Date: Mon, 15 Aug 2016 14:04:58 -0700 Message-ID: Subject: Re: Discussion: Should we make string/binary types first class Arrow Array types? To: Micah Kornfield Cc: Wes McKinney , dev@arrow.apache.org, Jacques Nadeau Content-Type: multipart/alternative; boundary=001a1143f7aef132c5053a229671 archived-at: Mon, 15 Aug 2016 21:05:25 -0000 --001a1143f7aef132c5053a229671 Content-Type: text/plain; charset=UTF-8 There's ARROW-258 which is about clarifying difference (if any) in metadata across RPC (sockets), IPC (shared memory) and files. The vector layout is the same except in RPC or files they get concatenated together when copied over. The metadata should be mostly the same (ideally the same). Buffer offsets are relative to the beginning of the body in the context of RPC and file start in files. In the context of IPC it looks like we need an extra page id (from Message.fbs). Is this correct? On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield wrote: > Thanks Wes, > This makes sense. +1 on the "Logical Types / IPC layout > document" is there a JIRA open for this? > > I'll open a JIRA item to change the inheritance of string/binary in the > C++ code base. > > Thanks, > Micah > > On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney > wrote: > >> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield >> wrote: >> > Sorry for the late reply. >> > >> > This all sounds reasonable to me. But I'm not sure I understand exactly >> > what you mean by >> > >> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string >> >> would be a single array unit in the buffer stream and flattened Field >> >> metadata rather than nested types (2 array units as they are >> >> presently). >> > >> > >> > The way I read it this seems to me to contradict the >> cross-implementation as >> > "List"? >> > >> > Thanks, >> > Micah >> > >> >> I think we can resolve this by starting a "Logical Types and IPC/RPC >> layout" specification document. >> >> The schema metadata >> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is, >> as I understand it, strictly the domain of logical types. I believe >> there is some minor conflation of the notions of primitive physical >> types and primitive logical types. >> >> While String / Binary have identical physical layouts to List> not null>, in the domain of logical types and IPC, what we are saying >> is that these types are: >> >> - logically speaking: primitive, non-nested types >> - their IPC layout is the flattened version of the nested List >> counterpart -- a single Field node having String type (with a null >> count, etc.), and 3 memory buffers: validity bitmap, offsets, and >> data. Structurally on the wire / in shared memory (compared with >> List) the only difference is the Field metadata (since >> if null count is 0 for the inner UInt8 values, then there is only a >> single buffer) -- one node versus two >> >> Let me know if this does not make sense. >> >> To move this forward I propose to begin a Logical Types / IPC layout >> document and begin to document the mapping between logical types and >> their physical in-memory representation and layout on the wire. >> >> - Wes >> > > -- Julien --001a1143f7aef132c5053a229671--