Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2ADC0200C05 for ; Mon, 23 Jan 2017 15:13:47 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 29831160B49; Mon, 23 Jan 2017 14:13:47 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 74A5D160B3E for ; Mon, 23 Jan 2017 15:13:46 +0100 (CET) Received: (qmail 11199 invoked by uid 500); 23 Jan 2017 14:13:45 -0000 Mailing-List: contact commits-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list commits@arrow.apache.org Received: (qmail 11185 invoked by uid 99); 23 Jan 2017 14:13:45 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jan 2017 14:13:45 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 8ECF0DFBE7; Mon, 23 Jan 2017 14:13:45 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: wesm@apache.org To: commits@arrow.apache.org Message-Id: <556fd2bbab4442c3ad1f56e086d8c4b3@git.apache.org> X-Mailer: ASF-Git Admin Mailer Subject: arrow git commit: ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases Date: Mon, 23 Jan 2017 14:13:45 +0000 (UTC) archived-at: Mon, 23 Jan 2017 14:13:47 -0000 Repository: arrow Updated Branches: refs/heads/master 282103012 -> 085c8754b ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases cc @julienledem @nongli @jacques-n. I am hoping to close the loop on our discussion in https://issues.apache.org/jira/browse/ARROW-81. In my applications, I need the flexibility to transmit: * Dictionaries encoded in signed integers smaller than int32. For example, with 10 dictionary values, we may send int8 indices * Indicator that the dictionary is ordered These features are needed for Python and R support, and in general for statistical computing applications. Author: Wes McKinney Closes #297 from wesm/ARROW-81 and squashes the following commits: c960bac [Wes McKinney] Augment dictionary encoding metadata to accommodate additional use cases Project: http://git-wip-us.apache.org/repos/asf/arrow/repo Commit: http://git-wip-us.apache.org/repos/asf/arrow/commit/085c8754 Tree: http://git-wip-us.apache.org/repos/asf/arrow/tree/085c8754 Diff: http://git-wip-us.apache.org/repos/asf/arrow/diff/085c8754 Branch: refs/heads/master Commit: 085c8754b0ab2da7fcd245fc88bc4de9a6806a4c Parents: 2821030 Author: Wes McKinney Authored: Mon Jan 23 09:13:39 2017 -0500 Committer: Wes McKinney Committed: Mon Jan 23 09:13:39 2017 -0500 ---------------------------------------------------------------------- format/Message.fbs | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/arrow/blob/085c8754/format/Message.fbs ---------------------------------------------------------------------- diff --git a/format/Message.fbs b/format/Message.fbs index b2c6464..028c56a 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -151,6 +151,26 @@ table KeyValue { } /// ---------------------------------------------------------------------- +/// Dictionary encoding metadata + +table DictionaryEncoding { + /// The known dictionary id in the application where this data is used. In + /// the file or streaming formats, the dictionary ids are found in the + /// DictionaryBatch messages + id: long; + + /// The dictionary indices are constrained to be positive integers. If this + /// field is null, the indices must be signed int32 + indexType: Int; + + /// By default, dictionaries are not ordered, or the order does not have + /// semantic meaning. In some statistical, applications, dictionary-encoding + /// is used to represent ordered categorical data, and we provide a way to + /// preserve that metadata here + isOrdered: bool; +} + +/// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. /// @@ -163,9 +183,10 @@ table Field { name: string; nullable: bool; type: Type; - // present only if the field is dictionary encoded - // will point to a dictionary provided by a DictionaryBatch message - dictionary: long; + + // Present only if the field is dictionary encoded + dictionary: DictionaryEncoding; + // children apply only to Nested data types like Struct, List and Union children: [Field]; /// layout of buffers produced for this type (as derived from the Type)