arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Adding a Map logical type to the Arrow metadata
Date Wed, 19 Jul 2017 17:37:07 GMT
List<Struct<K, V>> isn’t the only physical representation that makes sense. Because
it doesn’t take advantage of the fact that (a) keys can be re-ordered, (b) keys are unique.

So, another viable physical representation would be Struct<List<K>, List<V>>,
with the keys sorted. If keys are constant width and in contiguous memory then binary search
is very fast.

I am not claiming that this physical representation is better than yours. But the fact that
there is a more than one means it’s not a no-brainer.

Julian
 

> On Jul 18, 2017, at 12:10 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
> 
> I recently created https://issues.apache.org/jira/browse/ARROW-1207
> and wanted to discuss on the mailing list to hear opinions about how
> to proceed.
> 
> Some systems, like Spark [1], Presto [2], or Drill have a Map<K, V>
> composite type. These are sometimes stored in Parquet as a repeated
> struct, or in Arrow types List<item: Struct<key: K, value: V>>.
> 
> While we can represent in-memory map data as List<Struct<K, V>>, it
> may be useful to add a new logical type to the set of supported
> logical types [3]. The idea is that the memory format between Map<K,
> V> and List<Struct<K, V>> is identical, so this is strictly a logical
> construct, similar to date/time values having the same in-memory
> format as the corresponding integer types (int32/int64)
> 
> For Arrow implementation that do not provide a first class Map
> container, they could process the data as though it were a repeated
> struct. It would be helpful to us in C++ to have an arrow::MapArray
> container because we could convert to / from this type and other data
> structures like Python dictionaries. It would also be helpful to
> faithfully transport the MAP logical type from Parquet [4]
> 
> Let me know what others think. One question I have is whether the
> repeated struct in-memory representation makes sense as the canonical
> map representation.
> 
> Thanks
> Wes
> 
> [1]: https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/types/MapType.html
> [2]: https://prestodb.io/docs/current/functions/map.html
> [3]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L157
> [4]: https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L63


Mime
View raw message