arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: Format: storing null count + required/non-nullable types
Date Sat, 20 Feb 2016 20:31:24 GMT
I think part of it comes from the headache of all the types, etc. You get
into some sort of permutation fatigue :)

bool v. integer:

Interesting question. Haven't thought a lot about it but it seems like
cardinality of nulls could be a useful metric to decide algorithms. Not
sure that it is more expensive to maintain.

On Sat, Feb 20, 2016 at 12:12 PM, Daniel Robinson <danrobinson010@gmail.com>
wrote:

> I yield to your judgment from experience. Although I am surprised it
> wouldn't simplify the code in this case, since you will use the same
> algorithms on (and, if you never allocate a bitmask as Wes suggested, use
> the same data structures for) "nullable arrays" with null_count 0 as for
> non-nullable arrays. And formally I think it makes sense to have a nullable
> Array<T> have one of two types: some_nulls Array<T> or no_nulls Array<T>,
> much like its values have either type T or null.
>
> Is there a reason to make null_count an integer? Or could it just be a bool
> that keeps track of whether there are nulls or not?
>
>
> On Sat, Feb 20, 2016 at 3:03 PM, Jacques Nadeau <jacques@apache.org>
> wrote:
>
> > Makes sense
> >
> > On Sat, Feb 20, 2016 at 11:56 AM, Wes McKinney <wes@cloudera.com> wrote:
> >
> > > My expectation would be that data without nulls (as with required
> > > types) would typically not have the null bitmap allocated at, but this
> > > would be implementation dependent. For example, in builder classes,
> > > the first time a null is appended, the null bitmap could be allocated.
> > >
> > > In an IPC / wire protocol context, there would be no reason to send
> > > extra bits when the null count is 0 -- the data receiver, based on
> > > their implementation, could decide whether or not to allocate a bitmap
> > > based on that information. Since the data structures are intended as
> > > immutable, there is no specific need (to create an all-0 bitmap).
> > >
> > > On Sat, Feb 20, 2016 at 11:52 AM, Jacques Nadeau <jacques@apache.org>
> > > wrote:
> > > > We actually started there (and in fact Drill existed there for the
> last
> > > > three years). However, more and more, me and other members of that
> team
> > > > have come to the conclusion that the additional complexity isn't
> worth
> > > the
> > > > extra level of code complication. By providing the null count we can
> > > > achieve the same level of efficiency (+/- carrying around an extra
> > bitmap
> > > > which is pretty nominal in the grand scheme of things).
> > > >
> > > > Another thought could be exposing nullability as a physical property
> > and
> > > > not have be part of the logical model. That being said, I don't think
> > it
> > > is
> > > > worth the headache.
> > > >
> > > > On Sat, Feb 20, 2016 at 11:43 AM, Daniel Robinson <
> > > danrobinson010@gmail.com>
> > > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> I like this proposal (as well as the rest of the spec so far!).  But
> > why
> > > >> not go further and just store arrays that are nullable according to
> > the
> > > >> schema but have no nulls in them as "non-nullable" data
> > structures—i.e.
> > > >> structures that have no null bitmask? (After all, it would obviously
> > be
> > > a
> > > >> waste to allocate a null bitmask for arrays with null_count = 0.)
So
> > > there
> > > >> will be two types on the data structure level, and two
> implementations
> > > of
> > > >> every algorithm, one for each of those types.
> > > >>
> > > >> If you do that, I'm not sure I see a reason for keeping track of
> > > >> null_count. Is there ever an efficiency gain from having that stored
> > > with
> > > >> an array? Algorithms that might introduce or remove nulls could just
> > > keep
> > > >> track of their own "null_count" that increments up from 0, and
> create
> > a
> > > >> no-nulls data structure if they never find one.
> > > >>
> > > >> I think this might also simplify the system interchange validation
> > > problem,
> > > >> since a system could just check the data-structure-level type of the
> > > input.
> > > >> (Although I'm not sure I understand why that would be necessary at
> > > >> "runtime.")
> > > >>
> > > >> Perhaps you should have different names for the data-structure-level
> > > types
> > > >> to distinguish them from the "nullable" and "non-nullable" types at
> > the
> > > >> schema level. (And also for philosophical reasons—since the arrays
> are
> > > >> immutable, "nullable" doesn't really have meaning on that level,
> does
> > > it?).
> > > >> "some_null" and "no_null"?  Maybe "sparse" and "dense," although
> that
> > > too
> > > >> has a different meaning elsewhere in the spec...
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Feb 20, 2016 at 12:39 PM, Wes McKinney <wes@cloudera.com>
> > > wrote:
> > > >>
> > > >> > hi folks,
> > > >> >
> > > >> > welcome to all! It's great to see so many people excited about
our
> > > >> > plans to make data systems faster and more interoperable.
> > > >> >
> > > >> > In thinking about building some initial Arrow integrations, I've
> run
> > > >> > into a couple of inter-related format questions.
> > > >> >
> > > >> > The first is a proposal to add a null count to Arrow arrays.
With
> > > >> > optional/nullable data, null_count == 0 will allow algorithms
to
> > skip
> > > >> > the null-handling code paths and treat the data as
> > > >> > required/non-nullable, yielding performance benefits. For example:
> > > >> >
> > > >> > if (arr->null_count() == 0) {
> > > >> >   ...
> > > >> > } else {
> > > >> >   ...
> > > >> > }
> > > >> >
> > > >> > Relatedly, at the data structure level, there is little semantic
> > > >> > distinction between these two cases
> > > >> >
> > > >> > - Required / Non-nullable arrays
> > > >> > - Optional arrays with null count 0
> > > >> >
> > > >> > My thoughts are that "required-ness" would best be minded at
the
> > > >> > metadata / schema level, rather than tasking the lowest tier
of
> data
> > > >> > structures and algorithms with handling the two semantically
> > distinct,
> > > >> > but functionally equivalent forms of data without nulls. When
> > > >> > performing analytics, it adds complexity as some operations may
> > > >> > introduce or remove nulls, which would require type metadata
to be
> > > >> > massaged i.e.:
> > > >> >
> > > >> > function(required input) -> optional output, versus
> > > >> >
> > > >> > function(input [null_count == 0]) -> output [maybe null_count
>
> 0].
> > > >> >
> > > >> > In the latter case, algorithms set bits and track the number
of
> > nulls
> > > >> > while constructing the output Arrow array; the former adds some
> > extra
> > > >> > complexity.
> > > >> >
> > > >> > The question of course, is where to enforce "required" in data
> > > >> > interchange. If two systems have agreed (through exchange of
> > > >> > schemas/metadata) that a particular batch of Arrow data is
> > > >> > non-nullable, I would suggest that the null_count == 0 contract
be
> > > >> > validated at that point.
> > > >> >
> > > >> > Curious to hear others' thoughts on this, and please let me know
> if
> > I
> > > >> > can clarify anything I've said here.
> > > >> >
> > > >> > best,
> > > >> > Wes
> > > >> >
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message