arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Wed, 27 Jul 2016 05:09:07 GMT
Wes, Jacques, others...

Any thoughts on this?   Let me know if you would like to clarify something,
I think I was a little long winded.  It would be good to come to a
consensus one way or another.

Thanks,
Micah

On Sun, Jul 17, 2016 at 1:43 PM, Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Wes and Jacques,
>
> Thanks for the thorough analysis.  I agree that Strings should be easy to
> work with.  I'm just trying to understand how making a distinct string type
> defined in the memory layout spec [1] brings a lot of additional utility.
>
> I think of there being two distinct concerns with Arrow:
>
> 1.  Layout - What metadata and data elements are required to represent a
> specific type in a flat address space.
>
> 2.  Manipulation - How we build interfaces for working with the memory
> layout.
>
> With respect to Memory Layout, introducing a new string type seems to add
> redundancy.  As Wes noted, List<uint8 [not-null]> is sufficient to
> represent the layout for strings.  So the main benefits for introducing a
> new memory layout for a string type is an optimization.  By introducing the
> new type we avoid invalid string construction (having uint_t elements
> marked as null in the nested array) and to save a few bytes/extra function
> call when "(de)serializing" a string column.
>
> With respect to manipulation, I agree, that having the right API/modeling
> to treat strings as first class objects makes a lot of sense.   But I don't
> think that the specification needs to explicitly make allowances for it.
> Once you have constructed a Java/C++ wrapper around the memory layout you
> can choose to expose the right convenience APIs through OO abstraction.
> The construction of the correct object wrapper is governed by Metadata
> defined in [2] and an understanding of how the logical type maps to the
> appropriate memory layout.  At the moment metadata doesn't specify any sort
> of class hierarchy which I believe is the correct thing to do from a
> specification perspective.
>
> The C++ implementation currently has StringArrays inheriting from a
> ListArrays which was an implementation convenience and something we should
> revisit (I agree with Wes's point on not relying on  C++'s type system for
> casting).
> The primary argument for changing the existing implementation seems to be
> that strings should be considered "non-nested" types.  Whether strings are
> nested or not seems to fall squarely into the manipulation concern (except
> for the optimizations mentioned above) and therefore, IMO, an
> implementation detail.     When thinking about how this plays out in code
> I imagine a visitor pattern.  I've provided some pseudo-code below for two
> possible visitor classes make StringArrays first class objects but wouldn't
> require updates to the specification.
>
> I've tried to think where testing a particular object for "nested"-ness
> makes sense by itself and couldn't come up with something off the top of my
> head.  It seems once you determine an Array is non-nested you still want to
> test for exact primitive type you are dealing with.
>
> Given these points I'm still ambivalent about adding a new string/binary
> type to the spec. It would be an improvement but it seems like a somewhat
> minor improvement.  If people can provide stronger use-cases for adding the
> new type I'd be less ambivalent, but at the moment this seems like more of
> an implementation concern.
>
> Thanks,
> Micah
>
> // Visitor patterns for arrays, that do not require any updates to the
> memory layout.
> class ClassVisitor {
>     void visit(Int32Array );
>     void visit(UInt32Array );
>     void visit(DoubleArray );
>     void visit(ListArray );
>     void visit(StringArray ); // if we changed the hierarchy, this would
> be sufficient to treat strings as a first class type
>     // Other types elided
> }
>
> or
>
> class BufferVisitor { // type disambiguation happens by calling the
> correctly
>                                 // overloaded method
>     void visit_numeric(TypeMedata, null_bitmap, value_buffer);
>     void visit_list(TypeMedata, null_bitmap, offset_buffer, Array
> nested_type);
>     void visit_string(TypeMetadata, null_bitmap, offset_buffer,
> byte_buffer); // sufficient for treating string types as non-nested.
>     // Other types elided.
> }
>
> [1] https://github.com/apache/arrow/blob/master/format/Layout.md
> [2] https://github.com/apache/arrow/blob/master/format/Message.fbs
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message