arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Should Nullable be a nested type?
Date Wed, 24 Feb 2016 23:36:26 GMT
hi Dan,

Thanks for these thoughts! There's a few separate problems

- Physical memory representation
- Metadata
- Implementation container type layering / user API

We aren't expecting any system that uses Arrow to necessarily use one
of the reference Arrow implementations, but I expect an ecosystem of
tools will develop around the reference implementations (e.g. Parquet
or Avro adapters).

In the thread I started last week, I proposed striking this
distinction and instead relegating REQUIRED (non-nullable) and
OPTIONAL (nullable) to the domain of metadata. In my opinion, having
nullability in the container types at all adds much code complexity,
and the two cases of

- null count == 0
- non-nullable

Must be treated as distinct, but in general algorithmically equivalent
(e.g. the bitmap need not be allocated, and would be ignored even if
so in most analytics code requiring knowledge of null-ness).

In trying to come up with functional type systems for representing the
semantics of the data, I find it is most useful to start from the
desired user API (which sufficiently captures the nuances of the
physical memory layout) and work backwards to an implementation.

I would be interested to see some rough C++ / pseudocode to help
illustrate how this would play out in downstream user code using
libarrow and its headers, and internal algorithms in that operate on
arrow Array instances. I've seen many functional programmers work on
these data structure + type system problems and end up chasing their
tails for a very long time. On the C++ side, that code you see in
apache/arrow is a limited prototype that I haven't used to build any
applications yet, so it'd be great to iterate on some working
prototypes of other layerings. For example, we could work on:

https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199

As pseudocode, suppose we had a function:

double Sum(DoubleArray* arr) {
  double total = 0;
  double* values = arr->values();
  if (arr->null_count() == 0) {
    for (int i = 0; i < arr->length(); ++i) {
      total += values[i];
    }
  } else {
    for (int i = 0; i < arr->length(); ++i) {
      if (arr->NotNull(i)) {
        total += values[i];
      }
    }
  }
  return total;
}

If you had instead NullableDoubleArray and DoubleArray, you might want
to generate different functions. This would likely be true for any
algorithm where nulls are semantically meaningful.

For me at least, the most important aspect of Arrow is the memory
layout. The purpose of the implementations is to provide a reference
user API to make it as convenient and fast as possible to create
Arrow-layout data, manipulate data in-memory, and send/receive data
via IPC/shared-memory.

- Wes

On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
<danrobinson010@gmail.com> wrote:
> (The below mostly uses terminology and examples from the draft spec
> <https://github.com/apache/arrow/blob/master/format/Layout.md> and C++
> implementation <https://github.com/apache/arrow/tree/master/cpp>).
>
> Under the current spec, if I understand it correctly, there are two
> versions of every type: a nullable version and a non-nullable version.
> Whether a type T is nullable is specified by an attribute on that type, so
> declarations look something like T [nullable]:
>
> Int32 [non-nullable]
>
> Int32 [nullable]
>
>
> In contrast, a variable-length version of type T is declared using a nested
> or parametric type, List<T>.
>
>
> List [non-nullable] <Int32 [non-nullable]>
> List [nullable] <Int32 [non-nullable]>
> List [non-nullable] <Int32 [nullable]>
> List [nullable] <Int32 [nullable] >
>
> Consistent with this, might it make sense to use a nested type,
> *Nullable<T>*, to declare a nullable version of that type? A slot in an
> array of type Nullable<T> would be able to contain either a null value or
> any value of type T. Using this system, the above examples would look like:
>
> Int32
> Nullable<Int32>
> List<Int32>
>
> Nullable<List<Int32>>
> List<Nullable<Int32>>
> Nullable<List<Nullable<Int32>>>>
>
>
> This could also be paralleled by a tweak to implementations (which wouldn't
> change how the data is stored in physical memory). Right now, in the C++
> implementation, any array might be nullable, so the base Array class
> contains the logic necessary to deal with nullable arrays, including some
> potentially unnecessary branching.
>
> Instead, much like how an array of type List<T> is instantiated as a
> ListArray
> <https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h>
>  that contains its own int32 array of offsets and a reference to a child
> "values" array of type T
> <https://github.com/apache/arrow/blob/master/format/Layout.md>, every array
> of type Nullable<T> could be instantiated as a NullableArray containing its
> own bitmask and a reference to a child array of type T. (It could also keep
> track of the null_count, if that idea is implemented.)
>
> The child array would not have to know that it is wrapped by a
> NullableArray, and we could abstract almost all of the logic for dealing
> with null values out of the Array and ArrayBuilder classes. (ArrayBuilder
> would need a method for incrementing an array's length without adding a
> particular value, but it already needs that to build a child array in a Sparse
> Union
> <https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png>
> ).
>
> I see a few potential additional advantages to this approach:
>
>    - It would effectively halve the number of distinct types.
>    - As mentioned above, it would be consistent with how the Arrow spec
>    treats List types and other nested types. Treating "nullability" as a
>    special attribute on each type seems more consistent with how something
>    like Dremel treats both "optional" and "repeated" fields.
>    - It would also be consistent with how "option types
>    <https://en.wikipedia.org/wiki/Option_type>" are conventionally declared
>    in other languages. For example, C# has Nullable<T>
>    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has
>    Nullable{T}
>    <http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values>,
> Java
>    has Optional<T>
>    <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>,
> and Haskell
>    has Maybe a
>    <https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
>    - The implementation would be similar to a special case of SparseUnion,
>    which makes sense, since type Nullable<T> is equivalent to T ∪ NullType,
>    where NullType is a type that can only hold the null value.
>    - If the spec ever adds a Masking<T> type that allows a selection of a
>    vector using a bitmask, I imagine it, too, would be implemented similarly.
>    - Parsing a type declaration as a tree would be simpler.
>    - Declarations of non-nullable types would be shorter, and non-nullable
>    types would be the default. I think this makes sense, since it means types
>    such as Int32 correspond exactly to their C-type, and people would be less
>    likely to be surprised by unexpected null values at runtime. Additionally,
>    given the overhead (in speed, memory, complexity, and safety) of nullable
>    types, I think programmers should have to explicitly declare them.
>
> I'd appreciate any thoughts, especially from those with more involvement in
> the project or experience with related projects.
>
> One more thought (which could apply even if nullability is kept as a
> special attribute). To make the string representation of nullable types
> more readable, it might be worth considering a convention from C#
> <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
> <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
> Dremel
> <http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf>,
> and Flow <http://flowtype.org/docs/nullable-types.html>: using either T? or
> ?T to mean Nullable<T>.
>
> Int32
> ?Int32
> List<Int32>
> ?List<Int32>
> List<?Int32>
> ?List<?Int32>
>
> Int32
> Int32?
> List<Int32>
> List?<Int32>
> List<Int32?>
> List?<Int32?>

My inclination is still to remove nullable from the C++ array
container types, but to keep track of the required (non-nullable) /
optional (nullable) bit in the type metadata so that it can be kept
track of in IPC / schema exchange with other systems as necessary.

Mime
View raw message