arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ARROW-81) C++: Add a Category nested type
Date Wed, 17 Aug 2016 17:52:21 GMT

    [ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425026#comment-15425026
] 

Wes McKinney commented on ARROW-81:
-----------------------------------

There is no doubt that a Category logical type / metadata is necessary for many use cases
 (because it is semantically distinct from dictionary-encoded data, even though the physical
representation is the same). For example: statistics and machine learning users from many
communities would not be able to faithfully round trip data to Arrow metadata without it.
I will ask others to give their perspective on this if you would like to hear from others.


The implementation (physical representation) of Category is the open question. I would propose
for it to be a dictionary-encoded struct with a single child. For example:

{{Category[string] -> Struct<levels: String>}}

The additional metadata requirement is orderedness. This needs to be stored in the schema
as it needs to be a part of schema negotiation (rather than only observed in the realization
of the data in the dictionary). 

By using dictionary encoding for the implementation, one can also easily share dictionaries
used by multiple fields (having the same category/factor levels). 

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic
meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of the array.
Typically there is an "ordered" boolean flag indicating whether the order of the categories
is meaningful.
> Category/factor types are used in a number of common statistical analyses. See, for example,
http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement
for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should
consider what is necessary to be able to transmit category data in IPCs -- possible an expansion
of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message