arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <>
Subject Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding
Date Mon, 29 Apr 2019 18:53:26 GMT

Hi Wes,

Le 29/04/2019 à 20:10, Wes McKinney a écrit :
> * Receiving a record batch schema without the dictionaries attached
> (e.g. in Arrow Flight), see also experimental patch [2]

Note that this was finally done in a separate PR, and only required
changes in the IPC implementation.

> Here is my proposal to reconcile these issues in C++
> * Add a new "synthetic" data type called "variable dictionary" to be
> used alongside the existing "static dictionary" type. An instance of
> VariableDictionaryType (name TBD) will not know what the dictionary
> is, only the data type of the dictionary (e.g. utf8()) and the index
> type (e.g. int32())

Interesting idea.  I'm curious to see a PR.

> * Define common abstract API for instances of static vs variable
> dictionary arrays. Mainly this means making
> DictionaryArray::dictionary [3] virtual

I'm not sure this is required, especially if the following is implemented:

> * The _actual_ dictionary values for a particular Array must be stored
> somewhere and lifetime managed. I propose to put these as a single
> entry in ArrayData::child_data [4]. An alternative to this would be to
> modify ArrayData to have a dictionary field that would be unused
> except for encoded datasets

`child_data` is supposed to mirror more or less the order of buffers in
an IPC stream, right?  Therefore I would favour a dedicated dictionary
field (also makes fetching the dictionary trivial).

> This proposal does create some ongoing implementation and maintenance
> burden, but to that I would make these points:
> * Many algorithms will dispatch from one type to the other (probably
> static dispatching to the variable path), so there will not be a need
> to implement multiple times in most cases

Sounds believable indeed.



View raw message