arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Robinson <danrobinson...@gmail.com>
Subject Re: Should Nullable be a nested type?
Date Fri, 26 Feb 2016 03:54:20 GMT
Thanks; specific responses inline below.

I'm mostly persuaded, though—I never doubted the C++ code will end up
perfectly optimized. My main interest in this is the type definitions at
the metadata level, for which I really do think Nullable<T> should be
considered a nested type like List<T> or Struct<T, U>.


On Thu, Feb 25, 2016 at 9:17 PM, Wes McKinney <wes@cloudera.com> wrote:

> Inline responses
>
> On Thu, Feb 25, 2016 at 4:23 PM, Daniel Robinson
> <danrobinson010@gmail.com> wrote:
> > Hi Wes,
> >
> > Thanks for the response!
> >
> > I see the appeal of representing all arrays at the data structure level
> as
> > "nullable," but I think I've convinced myself that it's better to
> represent
> > them all as non-nullable, and bolt on "nullability" as a wrapper type. In
> > addition to the reasons I mentioned yesterday (such as localizing logic
> for
> > dealing with null values in its own class), I think there's a pretty
> strong
> > efficiency case for it.  If you don't distinguish between nullable and
> > non-nullable data structures at the data structure level, you're
> > essentially hiding useful optimization information from the compiler.
> >
>
> I question whether this optimization information is useful given many
> of the target workloads (e.g. scan-based analytics). On one code
> branch (if null_count > 0), a bitmap is examined either 1-bit at a
> time or using popcount (1 word's worth of values at a time -- this
> could improve CPU instruction pipelining), in the other branch the
> bitmap is ignored.
>
>
Makes sense.

> For example, arrays should probably have a simple API for getting a value
> > from a specified slot number, something like get(int32_t offset). This
> > makes it easier to write algorithms that don't have to worry about the
> > underlying memory structure of the array. If all arrays at the data
> > structure level are considered "nullable", this code would have to do an
> > "if (null_count == 0)" check every time get() is called, which is
> > inefficient and somewhat inelegant.
> >
>
> I have been expecting that any user of an Arrow C/C++ library would be
> expected to understand the physical memory layout; with functions to
> access values made available as a convenience. So I would not want you
> to call get(i) unless you know that it is not null and not out of
> bounds.
>

I included a safe_get function that assumes the value is within bounds, as
I discuss below. But I see the case for making the main get() API assume
that the value is neither null nor out of bounds.



>
> > I made a demo showing how I'd envision implementing nullable arrays in
> C++
> > with nested types, and how algorithms on them would work:
> > https://github.com/danrobinson/arrow-demo
> >
> > Right now it's just nullable and primitive types and arrays, but I'll try
> > adding list types and builders to see how it compares to the
> > string-handling code you linked to.
> >
> > One key advantage I see of making Nullable<T> its own nested type is that
> > the algorithm code can make full use of C++'s template specialization.
> For
> > example, here's how I define the versions of the Mean algorithm that work
> > on nullable arrays and those that work on non-nullable arrays (see
> > https://github.com/danrobinson/arrow-demo/blob/master/algorithms.h):
> >
> > // happens when you call Mean() on a non-nullable array
> > template<typename T>
> > double Mean(Array<T>& arr) {
> >   return (double)Sum(arr) / (arr.length());
> > }
> >
> > // happens when you call Mean() on a nullable array
> > template<typename T>
> > double Mean(Array< Nullable<T> >& nullableArr) {
> >   return (double)Sum(nullableArr) / (nullableArr.length() -
> > nullableArr.null_count());
> > }
> >
> > This is only possible if nullability is included in the
> > data-structure-level type information.
> >
> > As you suggested, the code also uses a shortcut when Sum() is called on a
> > "nullable" array that has null_count of 0: it just calls Sum() on its
> child
> > array (see line 24). I think this, too, is simpler if we aren't cramming
> > nullability logic into every array class.
> >
> > There are a few other tricks that could be used whether or not you treat
> > Nullable as a separate type. For example, we can implement an overloaded
> > get() function that takes, in addition to a slot offset, a default value
> > that should be returned if that slot is null (see
> > https://github.com/danrobinson/arrow-demo/blob/master/arrays.h#L114).
> This
> > let me write the main code for the Sum() algorithm only once, since for
> its
> > purposes, 0 is equivalent to null. (It would also presumably work with
> > Product() and more importantly sorting).
> >
>
> Cool, thanks for making a prototype! Working code is worth 1000 words =)
>
> I'm not convinced by the performance argument. For example, this
> snippet has multiple if-then-else branches (which could possibly throw
> exceptions) and so would not be advisable to put in the inner loop of
> algorithms:
>
> value_type get(int32_t slotNumber, value_type null_value) {
>   check_bounds(slotNumber);
>   return safe_get(slotNumber, null_value);
> }
>
> "in time critical code, throwing an exception should be the exception,
> not the rule."
> http://www.boost.org/community/error_handling.html
>
>
As I mentioned, I included a safe_get() function that doesn't do a bounds
check, and that lets you pass a default value to be returned if it is null.
If inner loops use that, there is only one branch, the null check, and no
exceptions. (My Sum() code stupidly didn't use it; I've fixed it now.)  If
they need more complex null-handling logic they could call isNull() on the
nullable array then call get() on the child array.

But I appreciate your point that the most optimized algorithms would need
to understand the memory at a low level. Though I wonder how easy it will
be to specify such low-level algorithms for recursive algorithms on
arbitrarily nested data.

> Again, would appreciate any thoughts.
> >
> > I definitely agree that the interface for users should be the priority
> > (after the memory format). My own motivation for thinking about this is
> my
> > interest in making easy APIs for creating or accessing these memory
> > structures in higher-level dynamically typed languages.
> >
>
> I agree that making Arrow data structures accessible and easy-to-use
> in dynamic languages is desirable. I don't think that adding such
> overhead to reference C++ algorithms (which should probably be as
> close-to-the-metal and performance-oriented at possible, and expect
> users to use them responsibly) is a good idea, since usability /
> accessibility generally comes at the expense of CPU cycles and cache
> efficiency.
>
> For example, it would be straightforward to build "airbags", so to
> speak, in a Python wrapper C extension, since performance is less
> critical in user APIs oriented at accessibility and safety /
> error-handling. In dynamic languages this is especially important as
> bugs in e.g. Python user code should never result in segfaults/core
> dumps.
>
>
I completely agree with this; didn't mean to suggest there should be
runtime overhead in the C++ functions that are used in algorithms.  (The
exceptions in the functions were meant never to be thrown, but I see the
case for leaving them out entirely.)

My main concern was avoiding such unnecessary overhead (such as null checks
in a type that shouldn't be nullable). I think there's no downside to
having *shortcuts *at *compile time*—and, almost as significantly, simpler
and safer code—by encoding some information about nullability in the C++
type information. But I appreciate that optimized code would be able to
avoid that.



> cheers,
> Wes
>
> > On Wed, Feb 24, 2016 at 6:36 PM, Wes McKinney <wes@cloudera.com> wrote:
> >
> >> hi Dan,
> >>
> >> Thanks for these thoughts! There's a few separate problems
> >>
> >> - Physical memory representation
> >> - Metadata
> >> - Implementation container type layering / user API
> >>
> >> We aren't expecting any system that uses Arrow to necessarily use one
> >> of the reference Arrow implementations, but I expect an ecosystem of
> >> tools will develop around the reference implementations (e.g. Parquet
> >> or Avro adapters).
> >>
> >> In the thread I started last week, I proposed striking this
> >> distinction and instead relegating REQUIRED (non-nullable) and
> >> OPTIONAL (nullable) to the domain of metadata. In my opinion, having
> >> nullability in the container types at all adds much code complexity,
> >> and the two cases of
> >>
> >> - null count == 0
> >> - non-nullable
> >>
> >> Must be treated as distinct, but in general algorithmically equivalent
> >> (e.g. the bitmap need not be allocated, and would be ignored even if
> >> so in most analytics code requiring knowledge of null-ness).
> >>
> >> In trying to come up with functional type systems for representing the
> >> semantics of the data, I find it is most useful to start from the
> >> desired user API (which sufficiently captures the nuances of the
> >> physical memory layout) and work backwards to an implementation.
> >>
> >> I would be interested to see some rough C++ / pseudocode to help
> >> illustrate how this would play out in downstream user code using
> >> libarrow and its headers, and internal algorithms in that operate on
> >> arrow Array instances. I've seen many functional programmers work on
> >> these data structure + type system problems and end up chasing their
> >> tails for a very long time. On the C++ side, that code you see in
> >> apache/arrow is a limited prototype that I haven't used to build any
> >> applications yet, so it'd be great to iterate on some working
> >> prototypes of other layerings. For example, we could work on:
> >>
> >>
> >>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/types/string-test.cc#L199
> >>
> >> As pseudocode, suppose we had a function:
> >>
> >> double Sum(DoubleArray* arr) {
> >>   double total = 0;
> >>   double* values = arr->values();
> >>   if (arr->null_count() == 0) {
> >>     for (int i = 0; i < arr->length(); ++i) {
> >>       total += values[i];
> >>     }
> >>   } else {
> >>     for (int i = 0; i < arr->length(); ++i) {
> >>       if (arr->NotNull(i)) {
> >>         total += values[i];
> >>       }
> >>     }
> >>   }
> >>   return total;
> >> }
> >>
> >> If you had instead NullableDoubleArray and DoubleArray, you might want
> >> to generate different functions. This would likely be true for any
> >> algorithm where nulls are semantically meaningful.
> >>
> >> For me at least, the most important aspect of Arrow is the memory
> >> layout. The purpose of the implementations is to provide a reference
> >> user API to make it as convenient and fast as possible to create
> >> Arrow-layout data, manipulate data in-memory, and send/receive data
> >> via IPC/shared-memory.
> >>
> >> - Wes
> >>
> >> On Wed, Feb 24, 2016 at 2:18 PM, Daniel Robinson
> >> <danrobinson010@gmail.com> wrote:
> >> > (The below mostly uses terminology and examples from the draft spec
> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md> and
> C++
> >> > implementation <https://github.com/apache/arrow/tree/master/cpp>).
> >> >
> >> > Under the current spec, if I understand it correctly, there are two
> >> > versions of every type: a nullable version and a non-nullable version.
> >> > Whether a type T is nullable is specified by an attribute on that
> type,
> >> so
> >> > declarations look something like T [nullable]:
> >> >
> >> > Int32 [non-nullable]
> >> >
> >> > Int32 [nullable]
> >> >
> >> >
> >> > In contrast, a variable-length version of type T is declared using a
> >> nested
> >> > or parametric type, List<T>.
> >> >
> >> >
> >> > List [non-nullable] <Int32 [non-nullable]>
> >> > List [nullable] <Int32 [non-nullable]>
> >> > List [non-nullable] <Int32 [nullable]>
> >> > List [nullable] <Int32 [nullable] >
> >> >
> >> > Consistent with this, might it make sense to use a nested type,
> >> > *Nullable<T>*, to declare a nullable version of that type? A slot
in
> an
> >> > array of type Nullable<T> would be able to contain either a null
> value or
> >> > any value of type T. Using this system, the above examples would look
> >> like:
> >> >
> >> > Int32
> >> > Nullable<Int32>
> >> > List<Int32>
> >> >
> >> > Nullable<List<Int32>>
> >> > List<Nullable<Int32>>
> >> > Nullable<List<Nullable<Int32>>>>
> >> >
> >> >
> >> > This could also be paralleled by a tweak to implementations (which
> >> wouldn't
> >> > change how the data is stored in physical memory). Right now, in the
> C++
> >> > implementation, any array might be nullable, so the base Array class
> >> > contains the logic necessary to deal with nullable arrays, including
> some
> >> > potentially unnecessary branching.
> >> >
> >> > Instead, much like how an array of type List<T> is instantiated as
a
> >> > ListArray
> >> > <
> >>
> https://github.com/apache/arrow/blob/23c4b08d154f8079806a1f0258d7e4af17bdf5fd/cpp/src/arrow/types/list.h
> >> >
> >> >  that contains its own int32 array of offsets and a reference to a
> child
> >> > "values" array of type T
> >> > <https://github.com/apache/arrow/blob/master/format/Layout.md>, every
> >> array
> >> > of type Nullable<T> could be instantiated as a NullableArray
> containing
> >> its
> >> > own bitmask and a reference to a child array of type T. (It could also
> >> keep
> >> > track of the null_count, if that idea is implemented.)
> >> >
> >> > The child array would not have to know that it is wrapped by a
> >> > NullableArray, and we could abstract almost all of the logic for
> dealing
> >> > with null values out of the Array and ArrayBuilder classes.
> (ArrayBuilder
> >> > would need a method for incrementing an array's length without adding
> a
> >> > particular value, but it already needs that to build a child array in
> a
> >> Sparse
> >> > Union
> >> > <
> >>
> https://github.com/apache/arrow/blob/master/format/diagrams/layout-sparse-union.png
> >> >
> >> > ).
> >> >
> >> > I see a few potential additional advantages to this approach:
> >> >
> >> >    - It would effectively halve the number of distinct types.
> >> >    - As mentioned above, it would be consistent with how the Arrow
> spec
> >> >    treats List types and other nested types. Treating "nullability"
> as a
> >> >    special attribute on each type seems more consistent with how
> >> something
> >> >    like Dremel treats both "optional" and "repeated" fields.
> >> >    - It would also be consistent with how "option types
> >> >    <https://en.wikipedia.org/wiki/Option_type>" are conventionally
> >> declared
> >> >    in other languages. For example, C# has Nullable<T>
> >> >    <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Julia
> has
> >> >    Nullable{T}
> >> >    <
> >>
> http://docs.julialang.org/en/release-0.4/manual/types/#nullable-types-representing-missing-values
> >> >,
> >> > Java
> >> >    has Optional<T>
> >> >    <https://docs.oracle.com/javase/8/docs/api/java/util/Optional.html
> >,
> >> > and Haskell
> >> >    has Maybe a
> >> >    <
> >> https://hackage.haskell.org/package/base-4.8.2.0/docs/Data-Maybe.html>.
> >> >    - The implementation would be similar to a special case of
> >> SparseUnion,
> >> >    which makes sense, since type Nullable<T> is equivalent to T ∪
> >> NullType,
> >> >    where NullType is a type that can only hold the null value.
> >> >    - If the spec ever adds a Masking<T> type that allows a selection
> of a
> >> >    vector using a bitmask, I imagine it, too, would be implemented
> >> similarly.
> >> >    - Parsing a type declaration as a tree would be simpler.
> >> >    - Declarations of non-nullable types would be shorter, and
> >> non-nullable
> >> >    types would be the default. I think this makes sense, since it
> means
> >> types
> >> >    such as Int32 correspond exactly to their C-type, and people would
> be
> >> less
> >> >    likely to be surprised by unexpected null values at runtime.
> >> Additionally,
> >> >    given the overhead (in speed, memory, complexity, and safety) of
> >> nullable
> >> >    types, I think programmers should have to explicitly declare them.
> >> >
> >> > I'd appreciate any thoughts, especially from those with more
> involvement
> >> in
> >> > the project or experience with related projects.
> >> >
> >> > One more thought (which could apply even if nullability is kept as a
> >> > special attribute). To make the string representation of nullable
> types
> >> > more readable, it might be worth considering a convention from C#
> >> > <https://msdn.microsoft.com/en-us/library/1t3y8s4s.aspx>, Swift
> >> > <
> http://lithium3141.com/blog/2014/06/19/learning-swift-optional-types/>,
> >> > Dremel
> >> > <
> >>
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36632.pdf
> >> >,
> >> > and Flow <http://flowtype.org/docs/nullable-types.html>: using either
> >> T? or
> >> > ?T to mean Nullable<T>.
> >> >
> >> > Int32
> >> > ?Int32
> >> > List<Int32>
> >> > ?List<Int32>
> >> > List<?Int32>
> >> > ?List<?Int32>
> >> >
> >> > Int32
> >> > Int32?
> >> > List<Int32>
> >> > List?<Int32>
> >> > List<Int32?>
> >> > List?<Int32?>
> >>
> >> My inclination is still to remove nullable from the C++ array
> >> container types, but to keep track of the required (non-nullable) /
> >> optional (nullable) bit in the type metadata so that it can be kept
> >> track of in IPC / schema exchange with other systems as necessary.
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message