arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wes McKinney (JIRA)" <>
Subject [jira] [Commented] (ARROW-81) [Format] Add a Category logical type (distinct from dictionary-encoding)
Date Mon, 22 Aug 2016 05:04:20 GMT


Wes McKinney commented on ARROW-81:

You could have a case of redundancy if the span of dictionary indices and dictionary values
is the same. In practice, dictionary encoding integers with a wide span (for example -- if
max(values) - min(values) is some large number) can have significant performance benefits
as you can do things normally requiring a hash table (e.g. computing a frequency table) with
much less effort.

> [Format] Add a Category logical type (distinct from dictionary-encoding)
> ------------------------------------------------------------------------
>                 Key: ARROW-81
>                 URL:
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
> A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic
meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of the array.
Typically there is an "ordered" boolean flag indicating whether the order of the categories
is meaningful.
> Category/factor types are used in a number of common statistical analyses. See, for example, It is a basic requirement
for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should
consider what is necessary to be able to transmit category data in IPCs -- possible an expansion
of the Arrow format. 

This message was sent by Atlassian JIRA

View raw message