arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Wed, 10 Aug 2016 23:37:55 GMT
Yes

On Wed, Aug 10, 2016 at 11:24 AM, Wes McKinney <wesmckinn@gmail.com> wrote:

> I see the primary point of discussion on this to be whether String/Binary
> have the same layout on the wire as List<uint8-not null> (i.e. one
> Field/Node in the type tree versus two). I think what we are working
> towards is a single field rather than a List field and an Int field (bit
> width 8).
>
> On Aug 10, 2016, at 10:46 AM, Julien Le Dem <julien@dremio.com> wrote:
>
> Hi,
> Agreed.
> To paraphrase/complement what has been said:
> The types in format/Message.fbs [1] are "Logical types" or "user facing
> types", close to SQL types (they include String, Timestamp, Decimal, ...)
> and are related to Parquet's logical types [2][3].
> For each of those types there's a corresponding physical layout that is
> formally specified (example discussed here: String => List<UInt8-not null>).
> I'm going to open a couple of JIRA's to finalise the types and clarify the
> layout.
>
> [1] https://github.com/apache/arrow/blob/34e7f48cb71428c4d78cf00d8fdf00
> 45532d6607/format/Message.fbs#L63
> [2] https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
> [3] https://github.com/apache/parquet-format/blob/
> 66a5a7b982e291e06afb1da7ffe9da211318caba/src/main/thrift/
> parquet.thrift#L48
>
> Julien
>
> On Tue, Aug 9, 2016 at 4:20 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>
>> hi Micah
>>
>> I'm sorry for dropping the ball on this discussion. copying Julien as
>> he's been looking at the metadata recently.
>>
>> My thinking is that we should indicate in the format document that the
>> String and Binary logical types, as a matter of cross-implementation
>> convention, will have List<UInt8-not null> memory layout.
>>
>> In the C++ library at least, we can collapse the class structure to
>> make BinaryArray and StringArray not a subclass of ListArray,
>> factoring out common code that can be reused into helper inline
>> functions.
>>
>> Class hierarchy aside the main impact is adding entries to the Type
>> union in the Flatbuffers metadata
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L63
>>
>> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> would be a single array unit in the buffer stream and flattened Field
>> metadata rather than nested types (2 array units as they are
>> presently).
>>
>> Separately, I am very interested in discussing a form of logical
>> Binary/StringArray in the C++ implementation that is internally
>> dictionary encoded. I'm proposing this as a possible new UTF-8
>> representation for pandas in the future:
>> https://wesm.github.io/pandas2-design/strings.html#possible-
>> solution-new-non-numpy-string-memory-layout
>>
>> Hopefully this isn't too incoherent, but it would be good to arrive at
>> some conclusion in this discussion if we need to implement the
>> changes.
>>
>> Thanks
>> Wes
>>
>> On Tue, Jul 26, 2016 at 10:09 PM, Micah Kornfield <emkornfield@gmail.com>
>> wrote:
>> > Wes, Jacques, others...
>> >
>> > Any thoughts on this?   Let me know if you would like to clarify
>> something,
>> > I think I was a little long winded.  It would be good to come to a
>> > consensus one way or another.
>> >
>> > Thanks,
>> > Micah
>> >
>> > On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfield@gmail.com
>> >
>> > wrote:
>> >
>> >> Hi Wes and Jacques,
>> >>
>> >> Thanks for the thorough analysis.  I agree that Strings should be easy
>> to
>> >> work with.  I'm just trying to understand how making a distinct string
>> type
>> >> defined in the memory layout spec [1] brings a lot of additional
>> utility.
>> >>
>> >> I think of there being two distinct concerns with Arrow:
>> >>
>> >> 1.  Layout - What metadata and data elements are required to represent
>> a
>> >> specific type in a flat address space.
>> >>
>> >> 2.  Manipulation - How we build interfaces for working with the memory
>> >> layout.
>> >>
>> >> With respect to Memory Layout, introducing a new string type seems to
>> add
>> >> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
>> >> represent the layout for strings.  So the main benefits for
>> introducing a
>> >> new memory layout for a string type is an optimization.  By
>> introducing the
>> >> new type we avoid invalid string construction (having uint_t elements
>> >> marked as null in the nested array) and to save a few bytes/extra
>> function
>> >> call when "(de)serializing" a string column.
>> >>
>> >> With respect to manipulation, I agree, that having the right
>> API/modeling
>> >> to treat strings as first class objects makes a lot of sense.   But I
>> don't
>> >> think that the specification needs to explicitly make allowances for
>> it.
>> >> Once you have constructed a Java/C++ wrapper around the memory layout
>> you
>> >> can choose to expose the right convenience APIs through OO abstraction.
>> >> The construction of the correct object wrapper is governed by Metadata
>> >> defined in [2] and an understanding of how the logical type maps to the
>> >> appropriate memory layout.  At the moment metadata doesn't specify any
>> sort
>> >> of class hierarchy which I believe is the correct thing to do from a
>> >> specification perspective.
>> >>
>> >> The C++ implementation currently has StringArrays inheriting from a
>> >> ListArrays which was an implementation convenience and something we
>> should
>> >> revisit (I agree with Wes's point on not relying on  C++'s type system
>> for
>> >> casting).
>> >> The primary argument for changing the existing implementation seems to
>> be
>> >> that strings should be considered "non-nested" types.  Whether strings
>> are
>> >> nested or not seems to fall squarely into the manipulation concern
>> (except
>> >> for the optimizations mentioned above) and therefore, IMO, an
>> >> implementation detail.     When thinking about how this plays out in
>> code
>> >> I imagine a visitor pattern.  I've provided some pseudo-code below for
>> two
>> >> possible visitor classes make StringArrays first class objects but
>> wouldn't
>> >> require updates to the specification.
>> >>
>> >> I've tried to think where testing a particular object for "nested"-ness
>> >> makes sense by itself and couldn't come up with something off the top
>> of my
>> >> head.  It seems once you determine an Array is non-nested you still
>> want to
>> >> test for exact primitive type you are dealing with.
>> >>
>> >> Given these points I'm still ambivalent about adding a new
>> string/binary
>> >> type to the spec. It would be an improvement but it seems like a
>> somewhat
>> >> minor improvement.  If people can provide stronger use-cases for
>> adding the
>> >> new type I'd be less ambivalent, but at the moment this seems like
>> more of
>> >> an implementation concern.
>> >>
>> >> Thanks,
>> >> Micah
>> >>
>> >> // Visitor patterns for arrays, that do not require any updates to the
>> >> memory layout.
>> >> class ClassVisitor {
>> >>     void visit(Int32Array );
>> >>     void visit(UInt32Array );
>> >>     void visit(DoubleArray );
>> >>     void visit(ListArray );
>> >>     void visit(StringArray ); // if we changed the hierarchy, this
>> would
>> >> be sufficient to treat strings as a first class type
>> >>     // Other types elided
>> >> }
>> >>
>> >> or
>> >>
>> >> class BufferVisitor { // type disambiguation happens by calling the
>> >> correctly
>> >>                                 // overloaded method
>> >>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>> >>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
>> >> nested_type);
>> >>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
>> >> byte_buffer); // sufficient for treating string types as non-nested.
>> >>     // Other types elided.
>> >> }
>> >>
>> >> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
>> >> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>> >>
>> >>
>>
>
>
>
> --
> Julien
>
>


-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message