arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Kornfield (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ARROW-81) [Format] Add a Category logical type (distinct from dictionary-encoding)
Date Fri, 19 Aug 2016 23:34:20 GMT

    [ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429030#comment-15429030
] 

Micah Kornfield edited comment on ARROW-81 at 8/19/16 11:33 PM:
----------------------------------------------------------------

Yeah, sorry, I should be more consistent with terminology.  In this case I meant a logical
type (i.e.  new specific type in Message.fbs).   For the physical layout s the desire to represent
this as a Struct to allow extensibility with addition data in the enumeration (per Julien's
suggestion) or for some other reason?  It seems to be easier for system non aware of categorical
types to interact with the type if we don't add the extra level of nesting.

I was imagining something more in the order of the below for a categorical string (like R).
The general idea is you can have a categorical variable of an arbitrary type, but don't add
the extra nesting in the structure proposed above.  A Column's categoricalness becomes and
extra piece of metadata on the field.

{code}
String: dictionary-encoded
dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1]
dictionary_id: i
// new member
Categorical_type: Ordered|Unordered|None

dictionary i: 
type=[String]
values= ['foo', 'bar']
{code}

The new member above could either be a specific explicitly modeled flatbuffer element, or
we can create a general extension for key-value pairs on fields. Another example, of where
a key-value pair field might be useful is to pass along metadata about list types, indicating
that they are sorted/and or contain contain unique values.


was (Author: emkornfield@gmail.com):
Yeah, sorry, I should be more consistent with terminology.  In this case I meant a logical
type (i.e.  new specific type in Message.fbs).   For the physical layout s the desire to represent
this as a Struct to allow extensibility with addition data in the enumeration (per Julien's
suggestion) or for some other reason?  It seems to be easier for system non aware of categorical
types to interact with the type if we don't add the extra level of nesting.

I was imagining something more in the order of the below for a categorical string (like R).
The general idea is you can have a categorical variable of an arbitrary type, but don't add
the extra nesting in the structure proposed above.  A Column's categoricalness becomes and
extra piece of metadata on the field.

{code}
String: dictionary-encoded
dictionary indices: [0, 0, 0, 0, 1, 1, 1, 1]
dictionary_id: i
// new member
Categorical_type: Ordered|Unordered|None

dictionary i: 
type=[String]
{code}

The new member above could either be a specific explicitly modeled flatbuffer element, or
we can create a general extension for key-value pairs on fields. Another example, of where
a key-value pair field might be useful is to pass along metadata about list types, indicating
that they are sorted/and or contain contain unique values.

> [Format] Add a Category logical type (distinct from dictionary-encoding)
> ------------------------------------------------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic
meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of the array.
Typically there is an "ordered" boolean flag indicating whether the order of the categories
is meaningful.
> Category/factor types are used in a number of common statistical analyses. See, for example,
http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement
for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should
consider what is necessary to be able to transmit category data in IPCs -- possible an expansion
of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message