arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Robinson <danrobinson...@gmail.com>
Subject Re: Should Nullable be a nested type?
Date Fri, 26 Feb 2016 00:23:18 GMT
Hi Wes,

Thanks for the response!

I see the appeal of representing all arrays at the data structure level as
"nullable," but I think I've convinced myself that it's better to represent
them all as non-nullable, and bolt on "nullability" as a wrapper type. In
addition to the reasons I mentioned yesterday (such as localizing logic for
dealing with null values in its own class), I think there's a pretty strong
efficiency case for it.  If you don't distinguish between nullable and
non-nullable data structures at the data structure level, you're
essentially hiding useful optimization information from the compiler.

For example, arrays should probably have a simple API for getting a value
from a specified slot number, something like get(int32_t offset). This
makes it easier to write algorithms that don't have to worry about the
underlying memory structure of the array. If all arrays at the data
structure level are considered "nullable", this code would have to do an
"if (null_count == 0)" check every time get() is called, which is
inefficient and somewhat inelegant.

I made a demo showing how I'd envision implementing nullable arrays in C++
with nested types, and how algorithms on them would work:
https://github.com/danrobinson/arrow-demo

Right now it's just nullable and primitive types and arrays, but I'll try
adding list types and builders to see how it compares to the
string-handling code you linked to.

One key advantage I see of making Nullable<T> its own nested type is that
the algorithm code can make full use of C++'s template specialization. For
example, here's how I define the versions of the Mean algorithm that work
on nullable arrays and those that work on non-nullable arrays (see
https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h):

// happens when you call Mean() on a non-nullable array
template<typename T>
double Mean(Array<T>& arr) {
  return (double)Sum(arr) / (arr.length());
}

// happens when you call Mean() on a nullable array
template<typename T>
double Mean(Array< Nullable<T> >& nullableArr) {
  return (double)Sum(nullableArr) / (nullableArr.length() -
nullableArr.null_count());
}

This is only possible if nullability is included in the
data-structure-level type information.

As you suggested, the code also uses a shortcut when Sum() is called on a
"nullable" array that has null_count of 0: it just calls Sum() on its child
array (see line 24). I think this, too, is simpler if we aren't cramming
nullability logic into every array class.

There are a few other tricks that could be used whether or not you treat
Nullable as a separate type. For example, we can implement an overloaded
get() function that takes, in addition to a slot offset, a default value
that should be returned if that slot is null (see
https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114).  This
let me write the main code for the Sum() algorithm only once, since for its
purposes, 0 is equivalent to null. (It would also presumably work with
Product() and more importantly sorting).

Again, would appreciate any thoughts.

I definitely agree that the interface for users should be the priority
(after the memory format). My own motivation for thinking about this is my
interest in making easy APIs for creating or accessing these memory
structures in higher-level dynamically typed languages.

On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <wes@cloudera.com> wrote:

> hi Dan,
>
> Thanks for these thoughts! There's a few separate problems
>
> - Physical memory representation
> - Metadata
> - Implementation container type layering / user API
>
> We aren't expecting any system that uses Arrow to necessarily use one
> of the reference Arrow implementations, but I expect an ecosystem of
> tools will develop around the reference implementations (e.g. Parquet
> or Avro adapters).
>
> In the thread I started last week, I proposed striking this
> distinction and instead relegating REQUIRED (non-nullable) and
> OPTIONAL (nullable) to the domain of metadata. In my opinion, having
> nullability in the container types at all adds much code complexity,
> and the two cases of
>
> - null count == 0
> - non-nullable
>
> Must be treated as distinct, but in general algorithmically equivalent
> (e.g. the bitmap need not be allocated, and would be ignored even if
> so in most analytics code requiring knowledge of null-ness).
>
> In trying to come up with functional type systems for representing the
> semantics of the data, I find it is most useful to start from the
> desired user API (which sufficiently captures the nuances of the
> physical memory layout) and work backwards to an implementation.
>
> I would be interested to see some rough C++ / pseudocode to help
> illustrate how this would play out in downstream user code using
> libarrow and its headers, and internal algorithms in that operate on
> arrow Array instances. I've seen many functional programmers work on
> these data structure + type system problems and end up chasing their
> tails for a very long time. On the C++ side, that code you see in
> apache/arrow is a limited prototype that I haven't used to build any
> applications yet, so it'd be great to iterate on some working
> prototypes of other layerings. For example, we could work on:
>
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199
>
> As pseudocode, suppose we had a function:
>
> double Sum(DoubleArray* arr) {
>   double total = 0;
>   double* values = arr->values();
>   if (arr->null_count() == 0) {
>     for (int i = 0; i < arr->length(); ++i) {
>       total += values[i];
>     }
>   } else {
>     for (int i = 0; i < arr->length(); ++i) {
>       if (arr->NotNull(i)) {
>         total += values[i];
>       }
>     }
>   }
>   return total;
> }
>
> If you had instead NullableDoubleArray and DoubleArray, you might want
> to generate different functions. This would likely be true for any
> algorithm where nulls are semantically meaningful.
>
> For me at least, the most important aspect of Arrow is the memory
> layout. The purpose of the implementations is to provide a reference
> user API to make it as convenient and fast as possible to create
> Arrow-layout data, manipulate data in-memory, and send/receive data
> via IPC/shared-memory.
>
> - Wes
>
> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
> <danrobinson010@gmail.com> wrote:
> > (The below mostly uses terminology and examples from the draft spec
> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and C++
> > implementation <https://github.com/apache/arrow/tree/master/cpp>).
> >
> > Under the current spec, if I understand it correctly, there are two
> > versions of every type: a nullable version and a non-nullable version.
> > Whether a type T is nullable is specified by an attribute on that type,
> so
> > declarations look something like T [nullable]:
> >
> > Int32 [non-nullable]
> >
> > Int32 [nullable]
> >
> >
> > In contrast, a variable-length version of type T is declared using a
> nested
> > or parametric type, List<T>.
> >
> >
> > List [non-nullable] <Int32 [non-nullable]>
> > List [nullable] <Int32 [non-nullable]>
> > List [non-nullable] <Int32 [nullable]>
> > List [nullable] <Int32 [nullable] >
> >
> > Consistent with this, might it make sense to use a nested type,
> > *Nullable<T>*, to declare a nullable version of that type? A slot in an
> > array of type Nullable<T> would be able to contain either a null value or
> > any value of type T. Using this system, the above examples would look
> like:
> >
> > Int32
> > Nullable<Int32>
> > List<Int32>
> >
> > Nullable<List<Int32>>
> > List<Nullable<Int32>>
> > Nullable<List<Nullable<Int32>>>>
> >
> >
> > This could also be paralleled by a tweak to implementations (which
> wouldn't
> > change how the data is stored in physical memory). Right now, in the C++
> > implementation, any array might be nullable, so the base Array class
> > contains the logic necessary to deal with nullable arrays, including some
> > potentially unnecessary branching.
> >
> > Instead, much like how an array of type List<T> is instantiated as a
> > ListArray
> > <
> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h
> >
> >  that contains its own int32 array of offsets and a reference to a child
> > "values" array of type T
> > <https://github.com/apache/arrow/blob/master/format/Layout.md>, every
> array
> > of type Nullable<T> could be instantiated as a NullableArray containing
> its
> > own bitmask and a reference to a child array of type T. (It could also
> keep
> > track of the null_count, if that idea is implemented.)
> >
> > The child array would not have to know that it is wrapped by a
> > NullableArray, and we could abstract almost all of the logic for dealing
> > with null values out of the Array and ArrayBuilder classes. (ArrayBuilder
> > would need a method for incrementing an array's length without adding a
> > particular value, but it already needs that to build a child array in a
> Sparse
> > Union
> > <
> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png
> >
> > ).
> >
> > I see a few potential additional advantages to this approach:
> >
> >    - It would effectively halve the number of distinct types.
> >    - As mentioned above, it would be consistent with how the Arrow spec
> >    treats List types and other nested types. Treating "nullability" as a
> >    special attribute on each type seems more consistent with how
> something
> >    like Dremel treats both "optional" and "repeated" fields.
> >    - It would also be consistent with how "option types
> >    <https://en.wikipedia.org/wiki/Option_type>" are conventionally
> declared
> >    in other languages. For example, C# has Nullable<T>
> >    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has
> >    Nullable{T}
> >    <
> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values
> >,
> > Java
> >    has Optional<T>
> >    <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>,
> > and Haskell
> >    has Maybe a
> >    <
> https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
> >    - The implementation would be similar to a special case of
> SparseUnion,
> >    which makes sense, since type Nullable<T> is equivalent to T ∪
> NullType,
> >    where NullType is a type that can only hold the null value.
> >    - If the spec ever adds a Masking<T> type that allows a selection of a
> >    vector using a bitmask, I imagine it, too, would be implemented
> similarly.
> >    - Parsing a type declaration as a tree would be simpler.
> >    - Declarations of non-nullable types would be shorter, and
> non-nullable
> >    types would be the default. I think this makes sense, since it means
> types
> >    such as Int32 correspond exactly to their C-type, and people would be
> less
> >    likely to be surprised by unexpected null values at runtime.
> Additionally,
> >    given the overhead (in speed, memory, complexity, and safety) of
> nullable
> >    types, I think programmers should have to explicitly declare them.
> >
> > I'd appreciate any thoughts, especially from those with more involvement
> in
> > the project or experience with related projects.
> >
> > One more thought (which could apply even if nullability is kept as a
> > special attribute). To make the string representation of nullable types
> > more readable, it might be worth considering a convention from C#
> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
> > <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
> > Dremel
> > <
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
> >,
> > and Flow <http://flowtype.org/docs/nullable-types.html>: using either
> T? or
> > ?T to mean Nullable<T>.
> >
> > Int32
> > ?Int32
> > List<Int32>
> > ?List<Int32>
> > List<?Int32>
> > ?List<?Int32>
> >
> > Int32
> > Int32?
> > List<Int32>
> > List?<Int32>
> > List<Int32?>
> > List?<Int32?>
>
> My inclination is still to remove nullable from the C++ array
> container types, but to keep track of the required (non-nullable) /
> optional (nullable) bit in the type metadata so that it can be kept
> track of in IPC / schema exchange with other systems as necessary.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message