arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [DISCUSS][C++] Static versus variable Arrow dictionary encoding
Date Wed, 01 May 2019 04:08:14 GMT
hi Hatem,

Thanks for commenting.

I am not sure your solution will work reliably because code is written
against arrow::DictionaryType with the presumption that the dictionary
is known and static, and can be obtained by invoking
DictionaryType::dictionary. In the variable dictionary case, the
dictionary is a property of an instance of an Array, instead, so it is
only possible to know the dictionary if you have a realization of the
data. You say this would "allow carrying around the currently known
dictionary entries for "variable dictionaries" using the existing
accessor" -- there are two cases to consider:

* What happens when the dictionary gets bigger
* What happens when the dictionary entries change order when observing
the next piece of data, but the schema stays the same

In such cases, what is the API for obtaining the dictionary? If the
dictionary has changed then the prior state of the [immutable]
DataType instance is no longer valid.

To give another motivating example for variable dictionaries --
Antoine has developed a multi-threaded CSV reader [1]. The current
requirement of static dictionaries makes parallel conversions that
need to ultimately yield dictionary-encoded string columns both
awkward and computationally costly. If the CSV conversion code is
allowed to yield a different dictionary for each converted file chunk
then things are much simpler.

- Wes

[1]: https://github.com/apache/arrow/tree/master/cpp/src/arrow/csv

On Tue, Apr 30, 2019 at 12:05 PM Hatem Helal <hhelal@mathworks.com> wrote:
>
> Hi Wes,
>
> Thanks for the detailed writeup and I think this an important problem to solve.  I spent
some time thinking about this when working on ARROW-3769 and came to a similar conclusion
that the current dictionary type was limiting when doing partial reads of parquet files.
>
> I'm not sure if this makes sense but I wondered if you considered expanding the DictionaryType
to incorporate the variable or static states?  I think modelling these different states using
the existing data type might be simpler than introducing another data type.  The implementation
could be a simple flag "is_static" passed as part of the constructor and this would allow
carrying around the currently known dictionary entries for "variable dictionaries" using the
existing accessor.  I'd imagine that variable dictionaries could compare equal so long as
the data types of the index and dictionary are equal.
>
> There are obvious backwards compatibility implications of such a change but I don't know
if arrow even makes such a guarantee. This might be a dangerous change to make if clients
have already written code that assumes that DictionaryTypes are always static, but I see some
simplicity in extending the meaning of the existing data type.
>
> I'd be happy to take a stab at a draft PR for this if this idea sounds promising and/or
needs more fleshing out to have an opinion.
>
> Thanks,
>
> Hatem
>
>
> ´╗┐On 4/29/19, 7:10 PM, "Wes McKinney" <wesmckinn@gmail.com> wrote:
>
>     hi all,
>
>     There have been many discussions in passing on various issues and JIRA
>     tickets over the last months and years about how to manage
>     dictionary-encoded columnar arrays in-memory in C++. Here's a list of
>     some problems we have encountered:
>
>     * Dictionaries that may differ from one record batch to another, but
>     represent semantically parts of the same dataset. For example, each
>     row group in Parquet format may have different dictionaries, and
>     decoding from encoded to "dense" representation may be undesirable
>
>     * Support for dictionary "deltas" in the IPC protocol, using the
>     isDelta flag on DictionaryBatch [1]. It's conceivable we might want to
>     allow dictionaries to change altogether in the IPC protocol (where a
>     dictionary id appears again -- a replacement -- but isDelta = false)
>
>     * Receiving a record batch schema without the dictionaries attached
>     (e.g. in Arrow Flight), see also experimental patch [2]
>
>     * Encoded datasets may be produced in a distributed system where
>     "reconciling" the dictionary is not computationally feasible or
>     desirable (particularly if the encoded data is just going to be
>     aggregated)
>
>     The reason that these are "problems" has to do with the way that we
>     represent dictionary encoded data in C++. We have created a
>     "synthetic" DictionaryType object that is used for Array/Column types
>     or for an entry in arrow::Schema. The DictionaryType wraps the index
>     type (signed integer type) and the dictionary itself. So from Python
>     we have
>
>     >>> t = pa.dictionary(pa.int8(), pa.array(['a', 'b', 'c', 'd']))
>     >>> t
>     DictionaryType(dictionary<values=string, indices=int8, ordered=0>)
>     >>> t.dictionary
>     <pyarrow.lib.StringArray object at 0x7eff4847fd68>
>     [
>       "a",
>       "b",
>       "c",
>       "d"
>     ]
>     >>> t.index_type
>     DataType(int8)
>
>     This is useful in languages like Python because we have the notion of
>     Categorical types which are a combination of a static dictionary and a
>     contiguous array of indices.
>
>     It creates problems when we have changing dictionaries, because the
>     "schema" under this in-memory construction may change from record
>     batch to record batch. This means that types are deemed "unequal"
>     according to many code paths we've written.
>
>     To consider solutions to this problem, I want to first point out that
>     the way we are dealing with dictionary-encoded data in memory is a
>     purely semantic construct for C++ and the binding languages.
>     "Dictionary" is not a data type as all according to the Arrow IPC
>     protocol -- it is a method is transferring encoded / compressed data,
>     and the handling thereof is left to the implementations. There are
>     benefits to the method we are using now, in particular it makes
>     dynamic dispatch (including the visitor pattern, and virtual
>     functions) based on whether something is encoded or not simple. It
>     also leads to simple round trips of Categorical types from libraries
>     like pandas.
>
>     Here is my proposal to reconcile these issues in C++
>
>     * Add a new "synthetic" data type called "variable dictionary" to be
>     used alongside the existing "static dictionary" type. An instance of
>     VariableDictionaryType (name TBD) will not know what the dictionary
>     is, only the data type of the dictionary (e.g. utf8()) and the index
>     type (e.g. int32())
>
>     * Define common abstract API for instances of static vs variable
>     dictionary arrays. Mainly this means making
>     DictionaryArray::dictionary [3] virtual
>
>     * The _actual_ dictionary values for a particular Array must be stored
>     somewhere and lifetime managed. I propose to put these as a single
>     entry in ArrayData::child_data [4]. An alternative to this would be to
>     modify ArrayData to have a dictionary field that would be unused
>     except for encoded datasets
>
>     This proposal does create some ongoing implementation and maintenance
>     burden, but to that I would make these points:
>
>     * Many algorithms will dispatch from one type to the other (probably
>     static dispatching to the variable path), so there will not be a need
>     to implement multiple times in most cases
>
>     * In some algorithms, we may observe a stream of dictionary encoded
>     arrays, and we need only obtain the current dictionary as well as the
>     knowledge of whether it is the same as previous dictionaries. In hash
>     aggregations and other analytics I think we need to implement by
>     default under the assumption of dynamic/variable dictionaries
>
>     I haven't conceived of any other ideas (after much contemplation) how
>     to algebraically accommodate these use cases in our object model so
>     interested in the opinions of others. As a first use case for this I
>     would be personally looking to address reads of encoded data from
>     Parquet format without an intermediate pass through dense format
>     (which can be slow and wasteful for heavily compressed string data)
>
>     Thanks,
>     Wes
>
>     [1]: https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches
>     [2]: https://github.com/apache/arrow/pull/4067
>     [3]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L968
>     [4]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L208
>
>

Mime
View raw message