Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8BA13200B74 for ; Wed, 17 Aug 2016 19:52:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8A317160A6C; Wed, 17 Aug 2016 17:52:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id DAD4C160A8C for ; Wed, 17 Aug 2016 19:52:22 +0200 (CEST) Received: (qmail 3214 invoked by uid 500); 17 Aug 2016 17:52:22 -0000 Mailing-List: contact dev-help@arrow.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@arrow.apache.org Delivered-To: mailing list dev@arrow.apache.org Received: (qmail 3174 invoked by uid 99); 17 Aug 2016 17:52:22 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2016 17:52:22 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0667C2C02C6 for ; Wed, 17 Aug 2016 17:52:21 +0000 (UTC) Date: Wed, 17 Aug 2016 17:52:21 +0000 (UTC) From: "Wes McKinney (JIRA)" To: dev@arrow.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ARROW-81) C++: Add a Category nested type MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 17 Aug 2016 17:52:23 -0000 [ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425026#comment-15425026 ] Wes McKinney commented on ARROW-81: ----------------------------------- There is no doubt that a Category logical type / metadata is necessary for many use cases (because it is semantically distinct from dictionary-encoded data, even though the physical representation is the same). For example: statistics and machine learning users from many communities would not be able to faithfully round trip data to Arrow metadata without it. I will ask others to give their perspective on this if you would like to hear from others. The implementation (physical representation) of Category is the open question. I would propose for it to be a dictionary-encoded struct with a single child. For example: {{Category[string] -> Struct}} The additional metadata requirement is orderedness. This needs to be stored in the schema as it needs to be a part of schema negotiation (rather than only observed in the realization of the data in the dictionary). By using dictionary encoding for the implementation, one can also easily share dictionaries used by multiple fields (having the same category/factor levels). > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of the array. Typically there is an "ordered" boolean flag indicating whether the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. See, for example, http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should consider what is necessary to be able to transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)