arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Robinson <danrobinson...@gmail.com>
Subject "Empty" slots
Date Thu, 10 Mar 2016 07:31:53 GMT
Thanks, Wes. For hashing, my thought was that you could just hash the null
bitmask concatenated with the values array (this should be doable in
constant space for ordinary one-pass hash functions). If nulls are zeroed
out, this will guarantee that equal primitive arrays hash to the same
value. But if the empty slots in the value array are undefined and could
have any value, this method would give you false negatives, and you would
need to do something more complex and slower.

I was also curious about how this is treated in Drill ValueVectors. The
example illustration in the introductory explanation at
https://drill.apache.org/docs/value-vectors/ uses 0 for nulled values,
whatever that's worth.

This is a tangent, but I noticed Drill (and thus for the moment Java Arrow:
https://github.com/apache/arrow/blob/master/java/vector/src/main/codegen/templates/NullableValueVectors.java#L369)
uses
0 in the null bitmask to represent null values, while you use the NumPy
MaskedArray convention of having 1 represent null values. Is there an
advantage?

On Wednesday, March 9, 2016, Wes McKinney <wes@cloudera.com> wrote:

> hey Dan,
>
> You bring up a good point. It's unclear whether this would make
> hashing much less complex (if you ignore the null bits when computing
> the hash function, then it probably would, especially on repeated
> fields. We'd need to do some experiments to see if there are cases
> where skipping the null bits would yield bad hashes. I'm not sure)
>
> For the memory comparison question, I'd like to hear from the Drill
> team on their experience with ValueVectors and how they handle the
> "null" space in arrays. In the case of list / repeated values, you
> *do* need to initialize the null offset slot to the previous value so
> that offset computations always work. For example, the list<int32>
> array
>
> [[0, 1, 2], null, null, [4, 5, 6]]
>
> would need to have
>
> nulls [LSB right-to-left ordering]: 0 0 0 0 0 1 1 0
> offsets: 0 3 3 3 6
> values 0 1 2 4 5 6
>
> - Wes
>
> On Mon, Mar 7, 2016 at 4:08 PM, Daniel Robinson
> <danrobinson010@gmail.com> wrote:
> > Am I correct that under the current draft spec, the values in meaningless
> > slots (such as slots corresponding to a 1 in an associated null bitmask,
> or
> > unused slots in a sparse union) are undefined?
> >
> > If so, might it be worth considering requiring that they be zeroed out?
> > This would add some time overhead to array building (for sparse unions,
> > memory could be zeroed out when it is allocated; for nullable arrays, it
> > may be more efficient to do so when nulls are added), but would remove
> most
> > of the complexity from hash calculations and equality comparisons. (See
> > ARROW-32 and ARROW-38.)  For example, comparison of nullable primitive
> > arrays could be done with 2 calls to memcmp.  As a bonus, it would speed
> up
> > operations for which null and 0 are equivalent or near-equivalent (such
> as
> > sum, sort on unsigned integers, ).
> >
> > Other than the overhead, one potential downside is that it would make it
> > impossible to just add a null_bitmask over an existing array without
> > copying all of the memory. Perhaps this would best be done with a
> > separately defined masking vector, which would not require that all
> masked
> > values be set to 0 (and would thus require more complex hashing
> algorithms).
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message