arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Fri, 15 Jul 2016 15:19:14 GMT
There's 3 distinct issues here:

1) Physical memory representation
2) Metadata
3) Implementation details

On these

1) I think no one will argue that String/Binary have the same memory
representation as List<uint8 [not-null]>, and regardless of the
implementation that you can perform a zero-copy cast without copying
or duplicating buffers, only changing the array container metadata.

2) I'm +1 on String/Binary being logically first-class primitive
types, with the intent that they are not considered logically nested
types (but you can perform the cast described in #1 if you want to get
nested data without copying).

3) The C++ code sharing / duplication issue feels slightly orthogonal
to the above two items, which are about user semantics and metadata.
Effectively what would change is that
std::dynamic_pointer_cast<ListArray>(string_data) would no longer be
value, as in the class hierarchy, we would have


- Primitive
  - Integer
  - Floating
  - String
  - ...
- List
- Struct
- Union

rather than the present

- List
  - String (with the type metadata always set to List<uint8 [not-null]>)

>From a coding point of view, I should think we would eventually want
explicit casts that do not presume a certain C++ inheritance
hierarchy, which might cause downstream code brittleness. Hard to
predict this precisely at this moment.

- Wes

On Wed, Jul 13, 2016 at 10:28 PM, Micah Kornfield <emkornfield@gmail.com> wrote:
> Today String and Binary types are represented in memory as list<byte> [1]
>  and we use logical types to distinguish between a list of bytes and string
> type [2].
>
> The question of whether this is sufficient or if we should make a first
> class string/binary type has come up tangentially on a few threads and we
> should come try to come to a conclusion if we want to add it as part of a
> spec.   I think the current proposal is that the String type would consist
> of null-bitmap buffer, an offset buffer and a buffer containing bytes (for
> strings the bytes would be UTF-8 encoded strings).  The main difference
> with the list representation is, individual bytes cannot be marked as null
> because there isn't a nested Array.
>
> To quote Jacques for the pros of this approach:
>
>  My main argument is that the most basic types most people need come in
> this order from my experience:
>
> Int
> String
> Float
> Decimal
> Binary
>
> Note that I'm not focused on width here, just generally "what people use".
> So I think a string comes second in priority and ease of
> use/approachability necessitate this as a first class concept. This is
> beyond the fact that it has specialized rules that are separate from a
> List<Byte>.
>
>
>
> The main argument for not doing this is it adds additional types that need
> to be implemented and can lead to some amount of redundant code.  For
> instance, in the current C++ implementation we are able to have a String
> Array that extends a List Type and re-use already defined equality methods
> [3].
>
> What do people think?
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
> [3]
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string.h#L68

Mime
View raw message