arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <>
Subject Re: Some questions/proposals for the spec (
Date Sat, 09 Apr 2016 03:21:56 GMT
I think one of Arrow's initial design goals should be simplicity of
implementation of the spec.   We can always make things more
complicated in the future.

This leads me to prefer a fixed size.   Wes (or anyone else) in
practice have you seen a union of structs with more then 127 members?

I would vote for int8_t for the types array for unions and letting
consumers of Arrow nest Unions at the application layer if they need
more slots.

On Fri, Apr 8, 2016 at 8:33 AM, Wes McKinney <> wrote:
> On Fri, Apr 8, 2016 at 8:07 AM, Jacques Nadeau <> wrote:
>>> > I believe this choice was primarily about simplifying the code (similar
>>> to why we have a n+1
>>> > offsets instead of just n in the list/varchar representations (even
>>> though n=0 is always 0)). In both
>>> > situations, you don't have to worry about writing special code (and a
>>> condition) for the boundary
>>> > condition inside tight loops (e.g. the last few bytes need to be handled
>>> differently since they
>>> > aren't word width).
>>> Sounds reasonable.  It might be worth illustrating this with a
>>> concrete example.  One scenario that this scheme seems useful for is a
>>> creating a new bitmap based on evaluating a predicate (i.e. all
>>> elements >X).  In this case would it make sense to make it a multiple
>>> of 16, so we can consistently use SIMD instructions for the logical
>>> "and" operation?
>> Hmm... interesting thought. I'd have to look but I also recall some of the
>> newer stuff supporting even wider widths. What do others think?
>>> I think the spec is slightly inconsistent.  It says there is 6 bytes
>>> of overhead per entry but then follows: "with the smallest byte width
>>> capable of representing the number of types in the union."  I'm
>>> perfectly happy to say it is always 1, always 2, or always capped at
>>> 2.  I agree 32K/64K+ types is a very unlikely scenario.  We just need
>>> to clear up the ambiguity.
>> Agreed. Do you want to propose an approach & patch to clarify?
> I can also take responsibility for the ambiguity here. My preference
> is to use int16_t for the types array (memory suitably aligned), but
> as 1 byte will be sufficient nearly all of the time, it's a slight
> trade-off in memory use vs. code complexity, e.g.
> if (children_.size() < 128) {
>   // types is only 1 byte
> } else {
>   // types is 2 bytes
> }
> Realistically there won't be that many affected code paths, so I'm
> comfortable with either choice (2-bytes always, or 1 or 2 bytes
> depending on the size of the union).

View raw message