arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Should Nullable be a nested type?
Date Fri, 26 Feb 2016 01:29:42 GMT
hi Paul,

responses inline.

On Thu, Feb 25, 2016 at 5:05 PM, Paul Weiss <paulweiss.dev@gmail.com> wrote:
> Hi,
>
> For what it's worth my thoughts are as follows... Mostly from being burned
> BTW
>
> * for doubles and ints use a min value,  in Java I use Integer.MIN_VALUE
> and the double equivalent. It is reasonable to assume a program will have
> one "special"  value instead of null

This is slightly off topic from this discussion (which is about the
user API for interacting with Arrow data).

However: Since we are intending Arrow for full-fidelity data
interchange between system runtimes, and as a container for data from
serialization formats like Parquet and Avro, special values for nulls
is not an option available to us (i.e. the file formats and storage
systems such as Kudu may have data spanning the full range of values
within numeric types).

Columnar database systems have generally used bit- or byte-arrays as a
universal way to encode null for flat and nested types alike. In order
to enable vectorized / SIMD processing of the bitmap data it is best
to store the null bits/bytes contiguously and separate from the rest
of the array data.

> * for collections or vectors just create an empty one.   Having code that
> is littered with null checks is always error prone.
>

Parquet and other serialization formats do distinguish between null
collections and empty collections, so we must preserve that
distinction in Arrow as well.

- Wes

> The memory layout of a specific type cannot be different if you have arrays
> and nested structures.   Unless you add a lot of special handling which
> leads to performance degradation.
>
> Another approach is to allow the user to define what null represents while
> staying within the bounds of keeping the same type.   Basically you do away
> with nulls altogether and define what null means by picking a value at
> runtime or even compile time.  It simplifies serialization.
>
> -paul
> On Feb 25, 2016 7:23 PM, "Daniel Robinson" <danrobinson010@gmail.com> wrote:
>
>> Hi Wes,
>>
>> Thanks for the response!
>>
>> I see the appeal of representing all arrays at the data structure level as
>> "nullable," but I think I've convinced myself that it's better to represent
>> them all as non-nullable, and bolt on "nullability" as a wrapper type. In
>> addition to the reasons I mentioned yesterday (such as localizing logic for
>> dealing with null values in its own class), I think there's a pretty strong
>> efficiency case for it.  If you don't distinguish between nullable and
>> non-nullable data structures at the data structure level, you're
>> essentially hiding useful optimization information from the compiler.
>>
>> For example, arrays should probably have a simple API for getting a value
>> from a specified slot number, something like get(int32_t offset). This
>> makes it easier to write algorithms that don't have to worry about the
>> underlying memory structure of the array. If all arrays at the data
>> structure level are considered "nullable", this code would have to do an
>> "if (null_count == 0)" check every time get() is called, which is
>> inefficient and somewhat inelegant.
>>
>> I made a demo showing how I'd envision implementing nullable arrays in C++
>> with nested types, and how algorithms on them would work:
>> https://github.com/danrobinson/arrow-demo
>>
>> Right now it's just nullable and primitive types and arrays, but I'll try
>> adding list types and builders to see how it compares to the
>> string-handling code you linked to.
>>
>> One key advantage I see of making Nullable<T> its own nested type is that
>> the algorithm code can make full use of C++'s template specialization. For
>> example, here's how I define the versions of the Mean algorithm that work
>> on nullable arrays and those that work on non-nullable arrays (see
>> https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h):
>>
>> // happens when you call Mean() on a non-nullable array
>> template<typename T>
>> double Mean(Array<T>& arr) {
>>   return (double)Sum(arr) / (arr.length());
>> }
>>
>> // happens when you call Mean() on a nullable array
>> template<typename T>
>> double Mean(Array< Nullable<T> >& nullableArr) {
>>   return (double)Sum(nullableArr) / (nullableArr.length() -
>> nullableArr.null_count());
>> }
>>
>> This is only possible if nullability is included in the
>> data-structure-level type information.
>>
>> As you suggested, the code also uses a shortcut when Sum() is called on a
>> "nullable" array that has null_count of 0: it just calls Sum() on its child
>> array (see line 24). I think this, too, is simpler if we aren't cramming
>> nullability logic into every array class.
>>
>> There are a few other tricks that could be used whether or not you treat
>> Nullable as a separate type. For example, we can implement an overloaded
>> get() function that takes, in addition to a slot offset, a default value
>> that should be returned if that slot is null (see
>> https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114).
>> This
>> let me write the main code for the Sum() algorithm only once, since for its
>> purposes, 0 is equivalent to null. (It would also presumably work with
>> Product() and more importantly sorting).
>>
>> Again, would appreciate any thoughts.
>>
>> I definitely agree that the interface for users should be the priority
>> (after the memory format). My own motivation for thinking about this is my
>> interest in making easy APIs for creating or accessing these memory
>> structures in higher-level dynamically typed languages.
>>
>> On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <wes@cloudera.com> wrote:
>>
>> > hi Dan,
>> >
>> > Thanks for these thoughts! There's a few separate problems
>> >
>> > - Physical memory representation
>> > - Metadata
>> > - Implementation container type layering / user API
>> >
>> > We aren't expecting any system that uses Arrow to necessarily use one
>> > of the reference Arrow implementations, but I expect an ecosystem of
>> > tools will develop around the reference implementations (e.g. Parquet
>> > or Avro adapters).
>> >
>> > In the thread I started last week, I proposed striking this
>> > distinction and instead relegating REQUIRED (non-nullable) and
>> > OPTIONAL (nullable) to the domain of metadata. In my opinion, having
>> > nullability in the container types at all adds much code complexity,
>> > and the two cases of
>> >
>> > - null count == 0
>> > - non-nullable
>> >
>> > Must be treated as distinct, but in general algorithmically equivalent
>> > (e.g. the bitmap need not be allocated, and would be ignored even if
>> > so in most analytics code requiring knowledge of null-ness).
>> >
>> > In trying to come up with functional type systems for representing the
>> > semantics of the data, I find it is most useful to start from the
>> > desired user API (which sufficiently captures the nuances of the
>> > physical memory layout) and work backwards to an implementation.
>> >
>> > I would be interested to see some rough C++ / pseudocode to help
>> > illustrate how this would play out in downstream user code using
>> > libarrow and its headers, and internal algorithms in that operate on
>> > arrow Array instances. I've seen many functional programmers work on
>> > these data structure + type system problems and end up chasing their
>> > tails for a very long time. On the C++ side, that code you see in
>> > apache/arrow is a limited prototype that I haven't used to build any
>> > applications yet, so it'd be great to iterate on some working
>> > prototypes of other layerings. For example, we could work on:
>> >
>> >
>> >
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199
>> >
>> > As pseudocode, suppose we had a function:
>> >
>> > double Sum(DoubleArray* arr) {
>> >   double total = 0;
>> >   double* values = arr->values();
>> >   if (arr->null_count() == 0) {
>> >     for (int i = 0; i < arr->length(); ++i) {
>> >       total += values[i];
>> >     }
>> >   } else {
>> >     for (int i = 0; i < arr->length(); ++i) {
>> >       if (arr->NotNull(i)) {
>> >         total += values[i];
>> >       }
>> >     }
>> >   }
>> >   return total;
>> > }
>> >
>> > If you had instead NullableDoubleArray and DoubleArray, you might want
>> > to generate different functions. This would likely be true for any
>> > algorithm where nulls are semantically meaningful.
>> >
>> > For me at least, the most important aspect of Arrow is the memory
>> > layout. The purpose of the implementations is to provide a reference
>> > user API to make it as convenient and fast as possible to create
>> > Arrow-layout data, manipulate data in-memory, and send/receive data
>> > via IPC/shared-memory.
>> >
>> > - Wes
>> >
>> > On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
>> > <danrobinson010@gmail.com> wrote:
>> > > (The below mostly uses terminology and examples from the draft spec
>> > > <https://github.com/apache/arrow/blob/master/format/Layout.md> and
C++
>> > > implementation <https://github.com/apache/arrow/tree/master/cpp>).
>> > >
>> > > Under the current spec, if I understand it correctly, there are two
>> > > versions of every type: a nullable version and a non-nullable version.
>> > > Whether a type T is nullable is specified by an attribute on that type,
>> > so
>> > > declarations look something like T [nullable]:
>> > >
>> > > Int32 [non-nullable]
>> > >
>> > > Int32 [nullable]
>> > >
>> > >
>> > > In contrast, a variable-length version of type T is declared using a
>> > nested
>> > > or parametric type, List<T>.
>> > >
>> > >
>> > > List [non-nullable] <Int32 [non-nullable]>
>> > > List [nullable] <Int32 [non-nullable]>
>> > > List [non-nullable] <Int32 [nullable]>
>> > > List [nullable] <Int32 [nullable] >
>> > >
>> > > Consistent with this, might it make sense to use a nested type,
>> > > *Nullable<T>*, to declare a nullable version of that type? A slot
in an
>> > > array of type Nullable<T> would be able to contain either a null
value
>> or
>> > > any value of type T. Using this system, the above examples would look
>> > like:
>> > >
>> > > Int32
>> > > Nullable<Int32>
>> > > List<Int32>
>> > >
>> > > Nullable<List<Int32>>
>> > > List<Nullable<Int32>>
>> > > Nullable<List<Nullable<Int32>>>>
>> > >
>> > >
>> > > This could also be paralleled by a tweak to implementations (which
>> > wouldn't
>> > > change how the data is stored in physical memory). Right now, in the
>> C++
>> > > implementation, any array might be nullable, so the base Array class
>> > > contains the logic necessary to deal with nullable arrays, including
>> some
>> > > potentially unnecessary branching.
>> > >
>> > > Instead, much like how an array of type List<T> is instantiated as
a
>> > > ListArray
>> > > <
>> >
>> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h
>> > >
>> > >  that contains its own int32 array of offsets and a reference to a
>> child
>> > > "values" array of type T
>> > > <https://github.com/apache/arrow/blob/master/format/Layout.md>, every
>> > array
>> > > of type Nullable<T> could be instantiated as a NullableArray containing
>> > its
>> > > own bitmask and a reference to a child array of type T. (It could also
>> > keep
>> > > track of the null_count, if that idea is implemented.)
>> > >
>> > > The child array would not have to know that it is wrapped by a
>> > > NullableArray, and we could abstract almost all of the logic for
>> dealing
>> > > with null values out of the Array and ArrayBuilder classes.
>> (ArrayBuilder
>> > > would need a method for incrementing an array's length without adding a
>> > > particular value, but it already needs that to build a child array in a
>> > Sparse
>> > > Union
>> > > <
>> >
>> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png
>> > >
>> > > ).
>> > >
>> > > I see a few potential additional advantages to this approach:
>> > >
>> > >    - It would effectively halve the number of distinct types.
>> > >    - As mentioned above, it would be consistent with how the Arrow spec
>> > >    treats List types and other nested types. Treating "nullability" as
>> a
>> > >    special attribute on each type seems more consistent with how
>> > something
>> > >    like Dremel treats both "optional" and "repeated" fields.
>> > >    - It would also be consistent with how "option types
>> > >    <https://en.wikipedia.org/wiki/Option_type>" are conventionally
>> > declared
>> > >    in other languages. For example, C# has Nullable<T>
>> > >    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia
has
>> > >    Nullable{T}
>> > >    <
>> >
>> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values
>> > >,
>> > > Java
>> > >    has Optional<T>
>> > >    <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html
>> >,
>> > > and Haskell
>> > >    has Maybe a
>> > >    <
>> > https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
>> > >    - The implementation would be similar to a special case of
>> > SparseUnion,
>> > >    which makes sense, since type Nullable<T> is equivalent to T ∪
>> > NullType,
>> > >    where NullType is a type that can only hold the null value.
>> > >    - If the spec ever adds a Masking<T> type that allows a selection
>> of a
>> > >    vector using a bitmask, I imagine it, too, would be implemented
>> > similarly.
>> > >    - Parsing a type declaration as a tree would be simpler.
>> > >    - Declarations of non-nullable types would be shorter, and
>> > non-nullable
>> > >    types would be the default. I think this makes sense, since it means
>> > types
>> > >    such as Int32 correspond exactly to their C-type, and people would
>> be
>> > less
>> > >    likely to be surprised by unexpected null values at runtime.
>> > Additionally,
>> > >    given the overhead (in speed, memory, complexity, and safety) of
>> > nullable
>> > >    types, I think programmers should have to explicitly declare them.
>> > >
>> > > I'd appreciate any thoughts, especially from those with more
>> involvement
>> > in
>> > > the project or experience with related projects.
>> > >
>> > > One more thought (which could apply even if nullability is kept as a
>> > > special attribute). To make the string representation of nullable types
>> > > more readable, it might be worth considering a convention from C#
>> > > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
>> > > <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/
>> >,
>> > > Dremel
>> > > <
>> >
>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
>> > >,
>> > > and Flow <http://flowtype.org/docs/nullable-types.html>: using either
>> > T? or
>> > > ?T to mean Nullable<T>.
>> > >
>> > > Int32
>> > > ?Int32
>> > > List<Int32>
>> > > ?List<Int32>
>> > > List<?Int32>
>> > > ?List<?Int32>
>> > >
>> > > Int32
>> > > Int32?
>> > > List<Int32>
>> > > List?<Int32>
>> > > List<Int32?>
>> > > List?<Int32?>
>> >
>> > My inclination is still to remove nullable from the C++ array
>> > container types, but to keep track of the required (non-nullable) /
>> > optional (nullable) bit in the type metadata so that it can be kept
>> > track of in IPC / schema exchange with other systems as necessary.
>> >
>>

Mime
View raw message