arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <>
Subject Re: Some questions/proposals for the spec (
Date Fri, 08 Apr 2016 15:33:14 GMT
On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <> wrote:
>> > I believe this choice was primarily about simplifying the code (similar
>> to why we have a n+1
>> > offsets instead of just n in the list/varchar representations (even
>> though n=0 is always 0)). In both
>> > situations, you don't have to worry about writing special code (and a
>> condition) for the boundary
>> > condition inside tight loops (e.g. the last few bytes need to be handled
>> differently since they
>> > aren't word width).
>> Sounds reasonable.  It might be worth illustrating this with a
>> concrete example.  One scenario that this scheme seems useful for is a
>> creating a new bitmap based on evaluating a predicate (i.e. all
>> elements >X).  In this case would it make sense to make it a multiple
>> of 16, so we can consistently use SIMD instructions for the logical
>> "and" operation?
> Hmm... interesting thought. I'd have to look but I also recall some of the
> newer stuff supporting even wider widths. What do others think?
>> I think the spec is slightly inconsistent.  It says there is 6 bytes
>> of overhead per entry but then follows: "with the smallest byte width
>> capable of representing the number of types in the union."  I'm
>> perfectly happy to say it is always 1, always 2, or always capped at
>> 2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
>> to clear up the ambiguity.
> Agreed. Do you want to propose an approach & patch to clarify?

I can also take responsibility for the ambiguity here. My preference
is to use int16_t for the types array (memory suitably aligned), but
as 1 byte will be sufficient nearly all of the time, it's a slight
trade-off in memory use vs. code complexity, e.g.

if (children_.size() < 128) {
  // types is only 1 byte
} else {
  // types is 2 bytes

Realistically there won't be that many affected code paths, so I'm
comfortable with either choice (2-bytes always, or 1 or 2 bytes
depending on the size of the union).

View raw message