arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Proposed new type: Fixed width list
Date Wed, 13 Jul 2016 05:42:04 GMT
Two questions come to mind.
1.  Is it useful to have fixed width with list types exclusive of
binary types?
2.  Should binary/string types have there own separate memory
layout/be a primitive type?

IMO, I think I think the answer to 1  is yes.  Another example of a
use-case where this is handy is for the outputs of the aggregate
functions "histogram_numeric" and "percentile_approx" in Apache Hive
[1].

For #2, I'm still not sure I see the a clear benefit or harm either
way.  The benefit of having there own type, is by definition, you
don't need to worry about ill formed arrays (e.g. having a byte
declared null).  The potential cost is more code to deal with the
additional types (although we end up paying this cost a little bit
even if we treat everything as a list).

Jacques can you elaborate more on where you see harm in the reduction?
 If we can agree on the first question, it might pay to handle the
discussion of bytes/string as a primitive type on a separate thread (I
think it got lost previously due to many issues surfaced in the same
e-mail and a lack of time to do a google hangout.  I apologize for
that).

Thanks,
Micah

[1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <jacques@apache.org> wrote:
> Completely in support of fixed bit width types. Just thinking that it
> shouldn't be done by using a list.
>
> Not sure how the two are orthogonal. What am I missing?
>
> On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>>
>> I think it would be good to revisit that discussion. This is somewhat
>> orthogonal -- i.e. having a fixed-width binary type that does not have
>> an accompanying list of n + 1 offsets.
>>
>> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <jacques@apache.org>
>> wrote:
>> > I was further reflecting on the previous discussion on lists and
>> > binary/utf8. I think that treating strings (binary or utf8) as lists is
>> > too
>> > much of reduction. This seems like a good example of how they are
>> > treated
>> > differently (beyond the previously discussed not-possible-nullability).
>> > As
>> > such I'm -1 on this change and would prefer if we go back and further
>> > review the concept of treating a string of bits, or bytes as a
>> > "primitive"
>> > type.
>> >
>> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <wesmckinn@gmail.com>
>> > wrote:
>> >
>> >> I'm +1 on this. I've seen fixed-width strings and other things in many
>> >> different contexts. I would say that fixed-width binary is probably
>> >> the primary use case, but you could imaging casting int96 data to
>> >> fixed_list<3, int32>
>> >>
>> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield
>> >> <emkornfield@gmail.com>
>> >> wrote:
>> >> > This came up in a code review a while ago, but what do people think
>> >> > of
>> >> > adding a fixed width list type to the memory layout spec.
>> >> >
>> >> > This would have the same layout as the current list type.  Instead
of
>> >> > having a separate offset buffer to determine location and length of
>> >> > each list, the length would be stored as part of metadata and offsets
>> >> > would be calculated using multiplication instead of lookups.
>> >> >
>> >> > One use case for this is an easy mapping to the
>> >> > "FIXED_LEN_BYTE_ARRAY"
>> >> > in parquet.
>> >> >
>> >> > If people like the idea I can file a JIRA and update the current
>> >> layout.md.
>> >> >
>> >> > Thanks,
>> >> > -Micah
>> >>
>
>

Mime
View raw message