arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)
Date Thu, 11 Apr 2019 14:06:28 GMT

Le 11/04/2019 à 10:52, Micah Kornfield a écrit :
> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit offsets
> to Lists, Strings and binary data types.
> 
> Philipp started an implementation for the large list type [3] and I hacked
> together a potentially viable java implementation [4]
> 
> I'd like to kickoff the discussion for getting these types voted on.  I'm
> coupling them together because I think there are design consideration for
> how we evolve Schema.fbs
> 
> There are two proposed options:
> 1.  The current PR proposal which adds a new type LargeList:
>   // List with 64-bit offsets
>   table LargeList {}
> 
> 2.  As François suggested, it might cleaner to parameterize List with
> offset width.  I suppose something like:
> 
> table List {
>   // only 32 bit and 64 bit is supported.
>   bitWidth: int = 32;
> }
> 
> I think Option 2 is cleaner and potentially better long-term, but I think
> it breaks forward compatibility of the existing arrow libraries.  If we
> proceed with Option 2, I would advocate making the change to Schema.fbs all
> at once for all types (assuming we think that 64-bit offsets are desirable
> for all types) along with future compatibility checks to avoid multiple
> releases were future compatibility is broken (by broken I mean the
> inability to detect that an implementation is receiving data it can't
> read).    What are peoples thoughts on this?

I think Option 1 is ok.  Making List / String / Binary parameterizable
doesn't bring anything *concretely*, since the types will not be
physically interchangeable.  The cost of breaking compatibility should
be offset by a compelling benefit, which doesn't seem to exist here.

Of course, implementations are free to refactor their internals to avoid
code duplication (for example the C++ ListBuilder and LargeListBuilder
classes could be instances of a BaseListBuilder<IndexType> generic type)...

Regards

Antoine.

Mime
View raw message