arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antoine Pitrou <anto...@python.org>
Subject Re: [DISCUSS] 64-bit offset variable width types (i.e.Large List, Last String, Large bytes)
Date Mon, 15 Apr 2019 17:40:15 GMT

Hi,

I am not Jacques, but I will try to give my own point of view on this.

The distinction between logical and physical types can be modelled in
two different ways:

1) a physical type can denote several logical types, but a logical type
can only have a single physical representation.  This is currently the
Arrow model.

2) a physical type can denote several logical types, and a logical type
can also be denoted by several physical types.  This is the Parquet model.

(theoretically, there are two other possible models, but they are not
very interesting to consider, since they don't seem to cater to concrete
use cases)

Model 1 is obviously more restrictive, while model 2 is more flexible.
Model 2 could be said "higher level"; you see something similar if you
compare Python's and C++'s typing systems.  On the other hand, model 1
provides a potentially simpler programming model for implementors of
low-level kernels, as you can simply query the logical type of your data
and you automatically know its physical type.

The model chosen for Arrow is ingrained in its API.  If we want to
change the model we'd better do it wholesale (implying probably a large
refactoring and a significant number of unavoidable regressions) to
avoid subjecting users to a confusing middle point.

Also and as a sidenote, "convertibility" between different types can be
a hairy subject... Having strict boundaries between types avoids being
dragged into it too early.


To return to the original subject: IMHO, LargeList (resp. LargeBinary)
should be a distinct logical type from List (resp. Binary), the same way
Int64 is a distinct logical type from Int32.

Regards

Antoine.



Le 15/04/2019 à 18:45, Francois Saint-Jacques a écrit :
> Hello,
> 
> I would like understand where do we stand on logical types and physical
> types. As I understand, this proposal is for the physical representation.
> 
> In the context of an execution engine, the concept of logical types becomes
> more important as two physical representation might have the same semantical
> values, e.g. LargeList and List where all values fits in the 32-bits.  A
> more
> complex example would be an Integer array and a dictionary array where
> values
> are integers.
> 
> Is this something only something only relevant for execution engine? What
> about
> the (C++) Array.Equals method and related comparisons methods? This also
> touch
> the subject of type equality, e.g. dict with different but compatible
> encoding.
> 
> Jacques, knowing that you worked on Parquet (which follows this model) and
> Dremio,
> what is your opinion?
> 
> François
> 
> Some related tickets:
> - https://jira.apache.org/jira/browse/ARROW-554
> - https://jira.apache.org/jira/browse/ARROW-1741
> - https://jira.apache.org/jira/browse/ARROW-3144
> - https://jira.apache.org/jira/browse/ARROW-4097
> - https://jira.apache.org/jira/browse/ARROW-5052
> 
> 
> 
> On Thu, Apr 11, 2019 at 4:52 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
> 
>> ARROW-4810 [1] and ARROW-750 [2] discuss adding types with 64-bit offsets
>> to Lists, Strings and binary data types.
>>
>> Philipp started an implementation for the large list type [3] and I hacked
>> together a potentially viable java implementation [4]
>>
>> I'd like to kickoff the discussion for getting these types voted on.  I'm
>> coupling them together because I think there are design consideration for
>> how we evolve Schema.fbs
>>
>> There are two proposed options:
>> 1.  The current PR proposal which adds a new type LargeList:
>>   // List with 64-bit offsets
>>   table LargeList {}
>>
>> 2.  As François suggested, it might cleaner to parameterize List with
>> offset width.  I suppose something like:
>>
>> table List {
>>   // only 32 bit and 64 bit is supported.
>>   bitWidth: int = 32;
>> }
>>
>> I think Option 2 is cleaner and potentially better long-term, but I think
>> it breaks forward compatibility of the existing arrow libraries.  If we
>> proceed with Option 2, I would advocate making the change to Schema.fbs all
>> at once for all types (assuming we think that 64-bit offsets are desirable
>> for all types) along with future compatibility checks to avoid multiple
>> releases were future compatibility is broken (by broken I mean the
>> inability to detect that an implementation is receiving data it can't
>> read).    What are peoples thoughts on this?
>>
>> Also, any other concern with adding these types?
>>
>> Thanks,
>> Micah
>>
>> [1] https://issues.apache.org/jira/browse/ARROW-4810
>> [2] https://issues.apache.org/jira/browse/ARROW-750
>> [3] https://github.com/apache/arrow/pull/3848
>> [4]
>>
>> https://github.com/apache/arrow/commit/03956cac2202139e43404d7a994508080dc2cdd1
>>
> 

Mime
View raw message