arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Format: storing null count + required/non-nullable types
Date Sat, 20 Feb 2016 20:41:14 GMT
Since the data is immutable, not having to ever recompute the null
count has a lot of benefits -- I find that sum(isnull(x)) is a
commonly examined statistic. I'm expecting we'll soon have
hardware-accelerated popcount available in the implementations, so
that even the null-aware code paths can cheaply skip swaths of all
null data.

On Sat, Feb 20, 2016 at 12:31 PM, Jacques Nadeau <jacques@apache.org> wrote:
> I think part of it comes from the headache of all the types, etc. You get
> into some sort of permutation fatigue :)
>
> bool v. integer:
>
> Interesting question. Haven't thought a lot about it but it seems like
> cardinality of nulls could be a useful metric to decide algorithms. Not
> sure that it is more expensive to maintain.
>
> On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson010@gmail.com>
> wrote:
>
>> I yield to your judgment from experience. Although I am surprised it
>> wouldn't simplify the code in this case, since you will use the same
>> algorithms on (and, if you never allocate a bitmask as Wes suggested, use
>> the same data structures for) "nullable arrays" with null_count 0 as for
>> non-nullable arrays. And formally I think it makes sense to have a nullable
>> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>,
>> much like its values have either type T or null.
>>
>> Is there a reason to make null_count an integer? Or could it just be a bool
>> that keeps track of whether there are nulls or not?
>>
>>
>> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacques@apache.org>
>> wrote:
>>
>> > Makes sense
>> >
>> > On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <wes@cloudera.com> wrote:
>> >
>> > > My expectation would be that data without nulls (as with required
>> > > types) would typically not have the null bitmap allocated at, but this
>> > > would be implementation dependent. For example, in builder classes,
>> > > the first time a null is appended, the null bitmap could be allocated.
>> > >
>> > > In an IPC / wire protocol context, there would be no reason to send
>> > > extra bits when the null count is 0 -- the data receiver, based on
>> > > their implementation, could decide whether or not to allocate a bitmap
>> > > based on that information. Since the data structures are intended as
>> > > immutable, there is no specific need (to create an all-0 bitmap).
>> > >
>> > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacques@apache.org>
>> > > wrote:
>> > > > We actually started there (and in fact Drill existed there for the
>> last
>> > > > three years). However, more and more, me and other members of that
>> team
>> > > > have come to the conclusion that the additional complexity isn't
>> worth
>> > > the
>> > > > extra level of code complication. By providing the null count we can
>> > > > achieve the same level of efficiency (+/- carrying around an extra
>> > bitmap
>> > > > which is pretty nominal in the grand scheme of things).
>> > > >
>> > > > Another thought could be exposing nullability as a physical property
>> > and
>> > > > not have be part of the logical model. That being said, I don't think
>> > it
>> > > is
>> > > > worth the headache.
>> > > >
>> > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
>> > > danrobinson010@gmail.com>
>> > > > wrote:
>> > > >
>> > > >> Hi all,
>> > > >>
>> > > >> I like this proposal (as well as the rest of the spec so far!).
 But
>> > why
>> > > >> not go further and just store arrays that are nullable according
to
>> > the
>> > > >> schema but have no nulls in them as "non-nullable" data
>> > structures—i.e.
>> > > >> structures that have no null bitmask? (After all, it would obviously
>> > be
>> > > a
>> > > >> waste to allocate a null bitmask for arrays with null_count =
0.) So
>> > > there
>> > > >> will be two types on the data structure level, and two
>> implementations
>> > > of
>> > > >> every algorithm, one for each of those types.
>> > > >>
>> > > >> If you do that, I'm not sure I see a reason for keeping track
of
>> > > >> null_count. Is there ever an efficiency gain from having that
stored
>> > > with
>> > > >> an array? Algorithms that might introduce or remove nulls could
just
>> > > keep
>> > > >> track of their own "null_count" that increments up from 0, and
>> create
>> > a
>> > > >> no-nulls data structure if they never find one.
>> > > >>
>> > > >> I think this might also simplify the system interchange validation
>> > > problem,
>> > > >> since a system could just check the data-structure-level type
of the
>> > > input.
>> > > >> (Although I'm not sure I understand why that would be necessary
at
>> > > >> "runtime.")
>> > > >>
>> > > >> Perhaps you should have different names for the data-structure-level
>> > > types
>> > > >> to distinguish them from the "nullable" and "non-nullable" types
at
>> > the
>> > > >> schema level. (And also for philosophical reasons—since the
arrays
>> are
>> > > >> immutable, "nullable" doesn't really have meaning on that level,
>> does
>> > > it?).
>> > > >> "some_null" and "no_null"?  Maybe "sparse" and "dense," although
>> that
>> > > too
>> > > >> has a different meaning elsewhere in the spec...
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <wes@cloudera.com>
>> > > wrote:
>> > > >>
>> > > >> > hi folks,
>> > > >> >
>> > > >> > welcome to all! It's great to see so many people excited
about our
>> > > >> > plans to make data systems faster and more interoperable.
>> > > >> >
>> > > >> > In thinking about building some initial Arrow integrations,
I've
>> run
>> > > >> > into a couple of inter-related format questions.
>> > > >> >
>> > > >> > The first is a proposal to add a null count to Arrow arrays.
With
>> > > >> > optional/nullable data, null_count == 0 will allow algorithms
to
>> > skip
>> > > >> > the null-handling code paths and treat the data as
>> > > >> > required/non-nullable, yielding performance benefits. For
example:
>> > > >> >
>> > > >> > if (arr->null_count() == 0) {
>> > > >> >   ...
>> > > >> > } else {
>> > > >> >   ...
>> > > >> > }
>> > > >> >
>> > > >> > Relatedly, at the data structure level, there is little semantic
>> > > >> > distinction between these two cases
>> > > >> >
>> > > >> > - Required / Non-nullable arrays
>> > > >> > - Optional arrays with null count 0
>> > > >> >
>> > > >> > My thoughts are that "required-ness" would best be minded
at the
>> > > >> > metadata / schema level, rather than tasking the lowest tier
of
>> data
>> > > >> > structures and algorithms with handling the two semantically
>> > distinct,
>> > > >> > but functionally equivalent forms of data without nulls.
When
>> > > >> > performing analytics, it adds complexity as some operations
may
>> > > >> > introduce or remove nulls, which would require type metadata
to be
>> > > >> > massaged i.e.:
>> > > >> >
>> > > >> > function(required input) -> optional output, versus
>> > > >> >
>> > > >> > function(input [null_count == 0]) -> output [maybe null_count
>
>> 0].
>> > > >> >
>> > > >> > In the latter case, algorithms set bits and track the number
of
>> > nulls
>> > > >> > while constructing the output Arrow array; the former adds
some
>> > extra
>> > > >> > complexity.
>> > > >> >
>> > > >> > The question of course, is where to enforce "required" in
data
>> > > >> > interchange. If two systems have agreed (through exchange
of
>> > > >> > schemas/metadata) that a particular batch of Arrow data is
>> > > >> > non-nullable, I would suggest that the null_count == 0 contract
be
>> > > >> > validated at that point.
>> > > >> >
>> > > >> > Curious to hear others' thoughts on this, and please let
me know
>> if
>> > I
>> > > >> > can clarify anything I've said here.
>> > > >> >
>> > > >> > best,
>> > > >> > Wes
>> > > >> >
>> > > >>
>> > >
>> >
>>

Mime
View raw message