arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Hulette <brian.hule...@ccri.com>
Subject Allow dictionary-encoded children?
Date Fri, 06 Apr 2018 14:42:35 GMT
I've been considering a use-case with a dictionary-encoded struct 
column, which may contain some dictionary-encoded columns itself. More 
specifically, in this use-case each row represents a single observation 
in a geospatial track, which includes a position, a time, and some 
track-level metadata (track id, origin, destination, etc...). I would 
like to represent the metadata as a dictionary-encoded struct, since 
unique values will be repeated for each observation of that track, and I 
would _also_ like to dictionary-encode some of the metadata column's 
children, since unique values will typically be repeated in multiple tracks.

I think one could make a (totally legitimate) argument that this is 
stretching a format designed for tabular data too far. This use-case 
could also be accomplished by breaking out the struct metadata column 
into its own arrow table, and managing a new integer column that 
references that table. This would look almost identical to what I 
initially described, it just wouldn't rely on the arrow libraries to 
manage the "dictionary".


The spec doesn't have anything to say on this topic as far as I can 
tell, but our implementations don't currently allow a dictionary-encoded 
column's children to be dictionary-encoded themselves [1]. Is this just 
a simplifying assumption, or a hard rule that should be codified in the 
spec?

Thanks,
Brian

[1] 
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/metadata-internal.cc#L824

Mime
View raw message