drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5958) Revisit the List and RepeatedList vectors
Date Sun, 12 Nov 2017 23:08:01 GMT
Paul Rogers created DRILL-5958:

             Summary: Revisit the List and RepeatedList vectors
                 Key: DRILL-5958
                 URL: https://issues.apache.org/jira/browse/DRILL-5958
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.11.0
            Reporter: Paul Rogers

Drill provides a List vector used when reading JSON data. The semantics of this vector are
somewhat obscure and overly complex. This ticket asks to clean up the design and implementation
of this vector.

h4. Current Behavior

Drill contains two kinds of repeated types:

* Repeated vectors, which exist for all Drill types.
* List vectors, which exist outside the repeated vector system.

Lists are rather hard to explain.

Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1)
or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}.

Lists are an an odd duck and exist outside of this system. A list is not simply another level
of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is heterogeneous: it is just
a list of something.

For this reason, the List type is closely associated with the Union type: a list is, essentially,
a "repeated Union", though it is not implemented that way.

Strangely, Drill does have a {{RepeatedListVector}}, which introduces all manner of ambiguity.
Combining these, the cardinality hierarchy for unions is:

* {{UnionVector}} (like an optional union type)
* {{ListVector}} (repeated union)
* {{RepeatedListVector}} (a 2D union array)
* {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this
could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}}.)
* {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.)
* And so on.

For a primitive type, such as Int, we have:

* {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
* {{RepeatedIntVector}} (a 1D list of Int)
* {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. Not that this
could have been a {{RepeatedListVector}} that stores only ints.)
* {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of ints. This could
also be formed by a {{ListVector}} that contains a {{ListVector}} that contains a {{RepeatedIntVector}}
along with several other combinations.)

h4. Examples of Current Behavior

Lists and repeated types appeared to evolve to support JSON-like structures. For example:

{a: 10} {a: null}

Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.

{a: [10, 20]}

Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 

Drill does not allow nulls in such vectors, so we cannot represent:

{a: [10, null, 20]}

Once we go beyond 1D, we need lists:

{a: [[10, 20], [30, 40]]}

The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.

{a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}

The above requires a {{RepeatedListVector}} that contains a {{RepeatedIntVector}}.

Similarly, since lists can hold any type (just like a union), we can have repeated objects:

{a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}

The above would be represented as a {{ListVector}} that contains a {{RepeatedMapVector}}.
(Or, equivalently, a {{RepeatedListVector}} that contains a {{MapVector}}.)

Because the List vector is a union type, it can (presumably) also handle heterogeneous lists
(though this needs to be checked to see if the code actually supports this case):

{a: [10, "fred", 123.45, null]}

Since unions support combinations of not just scalars, but also scalars and complex types,
Drill can also support:

{a: [10, {b: "foo"}, null, [10, "bob"]]}

h4. Muddy Semantics

The above show a number of problems that make lists (and unions) far more complex than necessary:

* Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}.
* Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a {{RepeatedListVector}}
of {{FooVector}}.

The same solution used to handle extra layers of repetition is used to handle variant types

* Lists can handle any combination of scalars.
* Lists can handle any structure type (map, repeated map, list, repeated list).
* Lists are thus not typed. They are not a "List of Int", they are just a List.

h4. Mapping to SQL

The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map
the List semantics to a relational schema? If we don't have a clean answer, then the List
type, while clever, does not have a useful purpose and is instead distracting us from the
larger question of how we map JSON-like structures to a relational schema.

This is not a new problem; it has been with the community at least since the days of XML.
There simply XML (and JSON) structures that do not map to a relational structure, though all
relational structures can be mapped to JSON.

Drill's job, then, is to encode any JSON structure which has a relational interpretation.
IMHO, it is not Drill's job to encode non-relational structures. (In the same way, in years
past, it was not the job of BI tools to encode an XML-encoded HTML document to produce charts
or reports.)

h4. Allowable JSON Structure

Let us take a stab at the allowable JSON structures:

* Objects (at any level) which map to either the top-level row, or nested tuples (called "maps"
in Drill.)
* Arrays of scalars which map to scalar arrays in Drill (which must be flattened into the
global name space, using dotted names, to map to a JDBC or ODBC client.)
* Arrays of objects, which map to tuple arrays in Drill (which must be flattened or explicitly
joined to map to a JDBC or ODBC client.)
* Fields with heterogeneous types, which map to union/variant types in Drill (and which must
be reconciled to present as SQL to an ODBC/JDBD client.)
* Array with heterogeneous types, which is one convention for efficiently storing tuples.
These map to an array of union/variant types, and must be mapped in Drill to a true tuple,
then flattened, to present to a JDBC or ODBC client.

This definition starts with the relational model, then determines the many ways the relational
model can be encoded in JSON. This analysis suggests that we need not handle extreme cases:

* Multi-dimensional arrays (which can't be mapped to the relational model easily, though they
could be mapped to an MDX-style multi-dimensional model.)
* Heterogeneous collections of simple and structured types except in the tuple-as-array case.

For the heterogeneous cases, Drill would need some up-front information to know that, say,
an array represents a tuple and to expect the heterogeneous encoding. In this case, each array
slot would have a single type; it would not make sense to have a tuple-as-array in which each
slot is a variant. (There is no sane way to map such a structure back to a fixed-type relational

h4. Proposal: Generic Array Vector

Given the above, we can see that the current List and Union types are far too general for
use in a relational system. We can suggest simplifications.

* The union/variant type handles mismatched data types. (But, as explained in DRILL-5955,
a better solution is to dispense with the union/vector type and instead allow the user to
specify a type to which the input values are mapped at read time.)
* The map and repeated map vectors already handle nested tuples and nested tables (arrays
of tuples).

Given this, there is really no need for the union or list types. But, suppose we wanted to
keep them anyway (because, say, we are not convinced by the above arguments, Drill already
has them, and the community does not want to take the time to consider whether they should
be removed.)

One solution is to rationalize the type system:

* Define a base (required) vector for each primitive type and for the map (tuple) type.
* Define a single nullable vector type which holds one of the base vectors.
* Define a repeated vector type that holds one of the base vectors.
* If necessary, define a variant which can hold any primitive value and itself is considered
a nullable vector.

In this model, there is exactly one mapping from any array structure to a vector structure.

|| JSON || Vector Structure ||
| &#123;a: 10&#125; | IntVector |
| &#123;a: 10&#125;, &#123;a: null&#125; | NullableVector(IntVector) |
| &#123;a: 10&#125;, &#123;a: "foo"&#125; | VariantVector |
| &#123;a: \[10, 20]&#125; | ArrayVector(IntVector) |
| &#123;a: \[\[10, 20], \[30, 40]]&#125; | ArrayVector(ArrayVector(IntVector)) |
| &#123;a: &#123;b: 10&#125;&#125; | TupleVector |
| &#123;a: \[&#123;b: 10&#125;, &#123;b: 20&#125;]&#125; | ArrayVector(TupleVector)
| &#123;a: \[10, "foo"]&#125; | ArrayVector(VariantVector) |
| &#123;a: \[10, "foo"]&#125; | TupleVector (if hints provided) |

h4. Arrow

[Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution similar to
the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list of some type: {{List<Int>}},

h4. Backward Compatibility

Because Drill exposes internal vector representation through the network APIs, the above is
a breaking change for the network API. A solution such as DRILL-5947 is required before we
can implement a solution such as the above.

This message was sent by Atlassian JIRA

View raw message