arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Proposed new type: Fixed width list
Date Wed, 13 Jul 2016 18:16:12 GMT
Thanks Jacques.

I'm ok dropping the fixed width proposal for now and revisiting it at
a later point.  I'll start a thread later today to break off the
discussion on adding string/binary as a primitive type.

-Micah

On Wed, Jul 13, 2016 at 7:49 AM, Jacques Nadeau <jacques@apache.org> wrote:
>
> On Tue, Jul 12, 2016 at 10:42 PM, Micah Kornfield <emkornfield@gmail.com>
> wrote:
>>
>> Two questions come to mind.
>> 1.  Is it useful to have fixed width with list types exclusive of
>> binary types?
>
>
> I think "useful" isn't a strong enough reason to add more types. It seems
> like a fairly rare occurrence and thus a premature optimization. (I could be
> convinced otherwise with more evidence). I propose we avoid adding types
> unless there are present use cases that people need to solve something. For
> example, if the Hive guys are in the process of adopting Arrow and this
> becomes a big memory/cpu issue for them. (I think the other memory/cpu
> benefits of Arrow would make this highly unlikely for at least a year or
> two.)
>
> There are a number of specializations that will come in time but I worry
> that if we grow the types too wide (especially initially), everyone is only
> going to support a subset of types and then we're going to have the same
> challenges of incompatibility. Once we have two or three users who all are
> working against variable width types and complaining about the overhead, it
> seems like we are sure to build the right thing and avoid bit rot (something
> that we (I) learned the hard way by adding all the types under the sun early
> in the Drill ValueVectors construction).
>
>>
>> 2.  Should binary/string types have their own separate memory
>> layout/be a primitive type?
>
>
> I'm happy to cover this on a separate thread. My main argument is that the
> most basic types most people need come in this order from my experience:
>
> Int
> String
> Float
> Decimal
> Binary
>
> Note that I'm not focused on width here, just generally "what people use".
> So I think a string comes second in priority and ease of use/approachability
> necessitate this as a first class concept. This is beyond the fact that it
> has specialized rules that are separate from a List<Byte>.
>
>
>>
>>
>> IMO, I think I think the answer to 1  is yes.  Another example of a
>> use-case where this is handy is for the outputs of the aggregate
>> functions "histogram_numeric" and "percentile_approx" in Apache Hive
>> [1].
>>
>> For #2, I'm still not sure I see the a clear benefit or harm either
>> way.  The benefit of having there own type, is by definition, you
>> don't need to worry about ill formed arrays (e.g. having a byte
>> declared null).  The potential cost is more code to deal with the
>> additional types (although we end up paying this cost a little bit
>> even if we treat everything as a list).
>>
>> Jacques can you elaborate more on where you see harm in the reduction?
>>  If we can agree on the first question, it might pay to handle the
>> discussion of bytes/string as a primitive type on a separate thread (I
>> think it got lost previously due to many issues surfaced in the same
>> e-mail and a lack of time to do a google hangout.  I apologize for
>> that).
>>
>> Thanks,
>> Micah
>>
>> [1] https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
>>
>> On Tue, Jul 12, 2016 at 5:44 PM, Jacques Nadeau <jacques@apache.org>
>> wrote:
>> > Completely in support of fixed bit width types. Just thinking that it
>> > shouldn't be done by using a list.
>> >
>> > Not sure how the two are orthogonal. What am I missing?
>> >
>> > On Tue, Jul 12, 2016 at 5:38 PM, Wes McKinney <wesmckinn@gmail.com>
>> > wrote:
>> >>
>> >> I think it would be good to revisit that discussion. This is somewhat
>> >> orthogonal -- i.e. having a fixed-width binary type that does not have
>> >> an accompanying list of n + 1 offsets.
>> >>
>> >> On Tue, Jul 12, 2016 at 5:36 PM, Jacques Nadeau <jacques@apache.org>
>> >> wrote:
>> >> > I was further reflecting on the previous discussion on lists and
>> >> > binary/utf8. I think that treating strings (binary or utf8) as lists
>> >> > is
>> >> > too
>> >> > much of reduction. This seems like a good example of how they are
>> >> > treated
>> >> > differently (beyond the previously discussed
>> >> > not-possible-nullability).
>> >> > As
>> >> > such I'm -1 on this change and would prefer if we go back and further
>> >> > review the concept of treating a string of bits, or bytes as a
>> >> > "primitive"
>> >> > type.
>> >> >
>> >> > On Tue, Jul 12, 2016 at 5:19 PM, Wes McKinney <wesmckinn@gmail.com>
>> >> > wrote:
>> >> >
>> >> >> I'm +1 on this. I've seen fixed-width strings and other things
in
>> >> >> many
>> >> >> different contexts. I would say that fixed-width binary is probably
>> >> >> the primary use case, but you could imaging casting int96 data
to
>> >> >> fixed_list<3, int32>
>> >> >>
>> >> >> On Mon, Jul 11, 2016 at 11:24 PM, Micah Kornfield
>> >> >> <emkornfield@gmail.com>
>> >> >> wrote:
>> >> >> > This came up in a code review a while ago, but what do people
>> >> >> > think
>> >> >> > of
>> >> >> > adding a fixed width list type to the memory layout spec.
>> >> >> >
>> >> >> > This would have the same layout as the current list type.
 Instead
>> >> >> > of
>> >> >> > having a separate offset buffer to determine location and
length
>> >> >> > of
>> >> >> > each list, the length would be stored as part of metadata
and
>> >> >> > offsets
>> >> >> > would be calculated using multiplication instead of lookups.
>> >> >> >
>> >> >> > One use case for this is an easy mapping to the
>> >> >> > "FIXED_LEN_BYTE_ARRAY"
>> >> >> > in parquet.
>> >> >> >
>> >> >> > If people like the idea I can file a JIRA and update the current
>> >> >> layout.md.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > -Micah
>> >> >>
>> >
>> >
>
>

Mime
View raw message