arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <...@cloudera.com>
Subject Re: Should Nullable be a nested type?
Date Fri, 26 Feb 2016 02:17:08 GMT
Inline responses

On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson
<danrobinson010@gmail.com> wrote:
> Hi Wes,
>
> Thanks for the response!
>
> I see the appeal of representing all arrays at the data structure level as
> "nullable," but I think I've convinced myself that it's better to represent
> them all as non-nullable, and bolt on "nullability" as a wrapper type. In
> addition to the reasons I mentioned yesterday (such as localizing logic for
> dealing with null values in its own class), I think there's a pretty strong
> efficiency case for it.  If you don't distinguish between nullable and
> non-nullable data structures at the data structure level, you're
> essentially hiding useful optimization information from the compiler.
>

I question whether this optimization information is useful given many
of the target workloads (e.g. scan-based analytics). On one code
branch (if null_count > 0), a bitmap is examined either 1-bit at a
time or using popcount (1 word's worth of values at a time -- this
could improve CPU instruction pipelining), in the other branch the
bitmap is ignored.

> For example, arrays should probably have a simple API for getting a value
> from a specified slot number, something like get(int32_t offset). This
> makes it easier to write algorithms that don't have to worry about the
> underlying memory structure of the array. If all arrays at the data
> structure level are considered "nullable", this code would have to do an
> "if (null_count == 0)" check every time get() is called, which is
> inefficient and somewhat inelegant.
>

I have been expecting that any user of an Arrow C/C++ library would be
expected to understand the physical memory layout; with functions to
access values made available as a convenience. So I would not want you
to call get(i) unless you know that it is not null and not out of
bounds.

> I made a demo showing how I'd envision implementing nullable arrays in C++
> with nested types, and how algorithms on them would work:
> https://github.com/danrobinson/arrow-demo
>
> Right now it's just nullable and primitive types and arrays, but I'll try
> adding list types and builders to see how it compares to the
> string-handling code you linked to.
>
> One key advantage I see of making Nullable<T> its own nested type is that
> the algorithm code can make full use of C++'s template specialization. For
> example, here's how I define the versions of the Mean algorithm that work
> on nullable arrays and those that work on non-nullable arrays (see
> https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h):
>
> // happens when you call Mean() on a non-nullable array
> template<typename T>
> double Mean(Array<T>& arr) {
>   return (double)Sum(arr) / (arr.length());
> }
>
> // happens when you call Mean() on a nullable array
> template<typename T>
> double Mean(Array< Nullable<T> >& nullableArr) {
>   return (double)Sum(nullableArr) / (nullableArr.length() -
> nullableArr.null_count());
> }
>
> This is only possible if nullability is included in the
> data-structure-level type information.
>
> As you suggested, the code also uses a shortcut when Sum() is called on a
> "nullable" array that has null_count of 0: it just calls Sum() on its child
> array (see line 24). I think this, too, is simpler if we aren't cramming
> nullability logic into every array class.
>
> There are a few other tricks that could be used whether or not you treat
> Nullable as a separate type. For example, we can implement an overloaded
> get() function that takes, in addition to a slot offset, a default value
> that should be returned if that slot is null (see
> https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114).  This
> let me write the main code for the Sum() algorithm only once, since for its
> purposes, 0 is equivalent to null. (It would also presumably work with
> Product() and more importantly sorting).
>

Cool, thanks for making a prototype! Working code is worth 1000 words =)

I'm not convinced by the performance argument. For example, this
snippet has multiple if-then-else branches (which could possibly throw
exceptions) and so would not be advisable to put in the inner loop of
algorithms:

value_type get(int32_t slotNumber, value_type null_value) {
  check_bounds(slotNumber);
  return safe_get(slotNumber, null_value);
}

"in time critical code, throwing an exception should be the exception,
not the rule."
http://www.boost.org/community/error_handling.html

> Again, would appreciate any thoughts.
>
> I definitely agree that the interface for users should be the priority
> (after the memory format). My own motivation for thinking about this is my
> interest in making easy APIs for creating or accessing these memory
> structures in higher-level dynamically typed languages.
>

I agree that making Arrow data structures accessible and easy-to-use
in dynamic languages is desirable. I don't think that adding such
overhead to reference C++ algorithms (which should probably be as
close-to-the-metal and performance-oriented at possible, and expect
users to use them responsibly) is a good idea, since usability /
accessibility generally comes at the expense of CPU cycles and cache
efficiency.

For example, it would be straightforward to build "airbags", so to
speak, in a Python wrapper C extension, since performance is less
critical in user APIs oriented at accessibility and safety /
error-handling. In dynamic languages this is especially important as
bugs in e.g. Python user code should never result in segfaults/core
dumps.

cheers,
Wes

> On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <wes@cloudera.com> wrote:
>
>> hi Dan,
>>
>> Thanks for these thoughts! There's a few separate problems
>>
>> - Physical memory representation
>> - Metadata
>> - Implementation container type layering / user API
>>
>> We aren't expecting any system that uses Arrow to necessarily use one
>> of the reference Arrow implementations, but I expect an ecosystem of
>> tools will develop around the reference implementations (e.g. Parquet
>> or Avro adapters).
>>
>> In the thread I started last week, I proposed striking this
>> distinction and instead relegating REQUIRED (non-nullable) and
>> OPTIONAL (nullable) to the domain of metadata. In my opinion, having
>> nullability in the container types at all adds much code complexity,
>> and the two cases of
>>
>> - null count == 0
>> - non-nullable
>>
>> Must be treated as distinct, but in general algorithmically equivalent
>> (e.g. the bitmap need not be allocated, and would be ignored even if
>> so in most analytics code requiring knowledge of null-ness).
>>
>> In trying to come up with functional type systems for representing the
>> semantics of the data, I find it is most useful to start from the
>> desired user API (which sufficiently captures the nuances of the
>> physical memory layout) and work backwards to an implementation.
>>
>> I would be interested to see some rough C++ / pseudocode to help
>> illustrate how this would play out in downstream user code using
>> libarrow and its headers, and internal algorithms in that operate on
>> arrow Array instances. I've seen many functional programmers work on
>> these data structure + type system problems and end up chasing their
>> tails for a very long time. On the C++ side, that code you see in
>> apache/arrow is a limited prototype that I haven't used to build any
>> applications yet, so it'd be great to iterate on some working
>> prototypes of other layerings. For example, we could work on:
>>
>>
>> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199
>>
>> As pseudocode, suppose we had a function:
>>
>> double Sum(DoubleArray* arr) {
>>   double total = 0;
>>   double* values = arr->values();
>>   if (arr->null_count() == 0) {
>>     for (int i = 0; i < arr->length(); ++i) {
>>       total += values[i];
>>     }
>>   } else {
>>     for (int i = 0; i < arr->length(); ++i) {
>>       if (arr->NotNull(i)) {
>>         total += values[i];
>>       }
>>     }
>>   }
>>   return total;
>> }
>>
>> If you had instead NullableDoubleArray and DoubleArray, you might want
>> to generate different functions. This would likely be true for any
>> algorithm where nulls are semantically meaningful.
>>
>> For me at least, the most important aspect of Arrow is the memory
>> layout. The purpose of the implementations is to provide a reference
>> user API to make it as convenient and fast as possible to create
>> Arrow-layout data, manipulate data in-memory, and send/receive data
>> via IPC/shared-memory.
>>
>> - Wes
>>
>> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
>> <danrobinson010@gmail.com> wrote:
>> > (The below mostly uses terminology and examples from the draft spec
>> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and C++
>> > implementation <https://github.com/apache/arrow/tree/master/cpp>).
>> >
>> > Under the current spec, if I understand it correctly, there are two
>> > versions of every type: a nullable version and a non-nullable version.
>> > Whether a type T is nullable is specified by an attribute on that type,
>> so
>> > declarations look something like T [nullable]:
>> >
>> > Int32 [non-nullable]
>> >
>> > Int32 [nullable]
>> >
>> >
>> > In contrast, a variable-length version of type T is declared using a
>> nested
>> > or parametric type, List<T>.
>> >
>> >
>> > List [non-nullable] <Int32 [non-nullable]>
>> > List [nullable] <Int32 [non-nullable]>
>> > List [non-nullable] <Int32 [nullable]>
>> > List [nullable] <Int32 [nullable] >
>> >
>> > Consistent with this, might it make sense to use a nested type,
>> > *Nullable<T>*, to declare a nullable version of that type? A slot in an
>> > array of type Nullable<T> would be able to contain either a null value
or
>> > any value of type T. Using this system, the above examples would look
>> like:
>> >
>> > Int32
>> > Nullable<Int32>
>> > List<Int32>
>> >
>> > Nullable<List<Int32>>
>> > List<Nullable<Int32>>
>> > Nullable<List<Nullable<Int32>>>>
>> >
>> >
>> > This could also be paralleled by a tweak to implementations (which
>> wouldn't
>> > change how the data is stored in physical memory). Right now, in the C++
>> > implementation, any array might be nullable, so the base Array class
>> > contains the logic necessary to deal with nullable arrays, including some
>> > potentially unnecessary branching.
>> >
>> > Instead, much like how an array of type List<T> is instantiated as a
>> > ListArray
>> > <
>> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h
>> >
>> >  that contains its own int32 array of offsets and a reference to a child
>> > "values" array of type T
>> > <https://github.com/apache/arrow/blob/master/format/Layout.md>, every
>> array
>> > of type Nullable<T> could be instantiated as a NullableArray containing
>> its
>> > own bitmask and a reference to a child array of type T. (It could also
>> keep
>> > track of the null_count, if that idea is implemented.)
>> >
>> > The child array would not have to know that it is wrapped by a
>> > NullableArray, and we could abstract almost all of the logic for dealing
>> > with null values out of the Array and ArrayBuilder classes. (ArrayBuilder
>> > would need a method for incrementing an array's length without adding a
>> > particular value, but it already needs that to build a child array in a
>> Sparse
>> > Union
>> > <
>> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png
>> >
>> > ).
>> >
>> > I see a few potential additional advantages to this approach:
>> >
>> >    - It would effectively halve the number of distinct types.
>> >    - As mentioned above, it would be consistent with how the Arrow spec
>> >    treats List types and other nested types. Treating "nullability" as a
>> >    special attribute on each type seems more consistent with how
>> something
>> >    like Dremel treats both "optional" and "repeated" fields.
>> >    - It would also be consistent with how "option types
>> >    <https://en.wikipedia.org/wiki/Option_type>" are conventionally
>> declared
>> >    in other languages. For example, C# has Nullable<T>
>> >    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia has
>> >    Nullable{T}
>> >    <
>> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values
>> >,
>> > Java
>> >    has Optional<T>
>> >    <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html>,
>> > and Haskell
>> >    has Maybe a
>> >    <
>> https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
>> >    - The implementation would be similar to a special case of
>> SparseUnion,
>> >    which makes sense, since type Nullable<T> is equivalent to T ∪
>> NullType,
>> >    where NullType is a type that can only hold the null value.
>> >    - If the spec ever adds a Masking<T> type that allows a selection of
a
>> >    vector using a bitmask, I imagine it, too, would be implemented
>> similarly.
>> >    - Parsing a type declaration as a tree would be simpler.
>> >    - Declarations of non-nullable types would be shorter, and
>> non-nullable
>> >    types would be the default. I think this makes sense, since it means
>> types
>> >    such as Int32 correspond exactly to their C-type, and people would be
>> less
>> >    likely to be surprised by unexpected null values at runtime.
>> Additionally,
>> >    given the overhead (in speed, memory, complexity, and safety) of
>> nullable
>> >    types, I think programmers should have to explicitly declare them.
>> >
>> > I'd appreciate any thoughts, especially from those with more involvement
>> in
>> > the project or experience with related projects.
>> >
>> > One more thought (which could apply even if nullability is kept as a
>> > special attribute). To make the string representation of nullable types
>> > more readable, it might be worth considering a convention from C#
>> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
>> > <http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
>> > Dremel
>> > <
>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
>> >,
>> > and Flow <http://flowtype.org/docs/nullable-types.html>: using either
>> T? or
>> > ?T to mean Nullable<T>.
>> >
>> > Int32
>> > ?Int32
>> > List<Int32>
>> > ?List<Int32>
>> > List<?Int32>
>> > ?List<?Int32>
>> >
>> > Int32
>> > Int32?
>> > List<Int32>
>> > List?<Int32>
>> > List<Int32?>
>> > List?<Int32?>
>>
>> My inclination is still to remove nullable from the C++ array
>> container types, but to keep track of the required (non-nullable) /
>> optional (nullable) bit in the type metadata so that it can be kept
>> track of in IPC / schema exchange with other systems as necessary.
>>

Mime
View raw message