drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5955) Revisit Union Vectors
Date Sun, 12 Nov 2017 20:30:00 GMT
Paul Rogers created DRILL-5955:

             Summary: Revisit Union Vectors
                 Key: DRILL-5955
                 URL: https://issues.apache.org/jira/browse/DRILL-5955
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.11.0
            Reporter: Paul Rogers

Drill supports a “Union Vector” type that allows a single column to hold values of multiple
types. Conceptually, each column value is a (type, value) pair. For example, row 0 might be
an Int, row 1 a Varchar and row 2 a NULL value.

The name refers to a C “union” in which the same bit of memory is used to represent one
of a set of defined types.

Drill implements the union vector a bit like a map: as a collection of typed vectors. Each
value is keyed by type. The result is that a union vector is more like a C “struct” than
a C “union”: every vector takes space, but only one of the vectors is used for each row.
For the example above, the union vector contains an Int vector, a Varchar vector and a type
vector. For each row, either the Int or the Varchar is used. For NULL values, neither vector
is used.

h4. Memory Footprint Concerns

The current representation, despite its name, makes very inefficient use of memory because
it requires the sum of the storage for each included type. (That is, if we store 1000 rows,
we need 1000 slots for integers, another 1000 for Varchar and yet another 1000 for the type

Drill poorly supports the union type. One operator that does support it is the sort. If the
union type is enabled, and the sort sees a schema change, the sort will create a new union
vector that combines the two types. The result is a sudden, unplanned increase in memory usage.
Since the sort can buffer many hundreds of batches, this unplanned memory increase can cause
the sort to run out of memory.

h4. Muddy Semantics

The union vector is closely tied with the List vector: a list vector is, essentially, an array
of unions. (See DRILL-xxxx). The list type is used to model JSON in which a list can hold
anything: another list, an object or scalars. For this reason, the union vector also can hold
any type. And, indeed, it can hold a union of any of these types: a Map and an Int, or a List
and a Map.

Drill is a relational, SQL-based tool. Work is required to bring non-relational structures
into Drill. As discussed below, a union of scalars can be made to work. But, a union of structured
types (lists, arrays or Maps) makes no sense.

h4. High Complexity

The union vector, as implemented is quite complex. It contains member variables for every
other vector type (except, strangely, the decimal types.) Access to typed members is by type-specific
methods, meaning that the client code must include a separate call for every type, resulting
in very complex client code.

The complexity allowed the union type to be made to work, but causes this one type to consume
a disproportionate amount of the vector and client code.

h4. Proposed Revision to Structure: The Variant Type

Given the above, we can now present the proposed changes. First let us recognize that a union
vector need not hold structured types; there are other solutions as discussed in DRILL-xxxx.
This leaves the union vector to hold just scalars.

h4. Proposed Revision to Storage

This in turn lets us adopt the [Variant type|https://en.wikipedia.org/wiki/Variant_type] originally
introduced in Visual Basic. Variant “is a tagged union that can be used to represent any
other data type”. The Variant type was designed to be compact by building on the idea of
a tagged union in C.

struct {
  int tag; // type
  union {
    int intValue;
    long longValue;

When implemented as a vector, the format could consume just a single variable-width vector
with each entry of the form: {{\[type value]}}. The vector is simply a sequence of these (type,
value) pairs.

The type is a single-byte that encodes the MinorType that describes the value. That is, the
type byte is like the existing type vector, but stored in the same location as the data. The
data is simply the serialized format of data. (Four bytes for an Int, 8 bytes for a Float8
and so on.)

Variable-width types require an extra field: the type field: {{\[type length value]}}. For
example, a Varchar would be encoded as {{\[Varchar 27 byte0-26]}}.

A writer uses the type to drive the serialization. A reader uses the type to drive deserialization.

Note that the type field must include a special marker for nulls. Today, the union type uses
0 to indicate a null value. (Note that, in a union and variant, a null value is not a null
of some type, both the type and value are null.) That form should be used in the variant representation
as well. But, note that the 0 value in the MajorType enum is not Null but is instead Late.
This is an unpleasant messiness that the union (and variant )encoding must handle.

An offset vector provides the location of each field, as is done with variable-length vectors

The result is huge compaction of space requirements from multiple vectors per type to just
two vectors (offsets and data.)

Such a change would be daunting if clients work directly with vectors. However, with the introduction
of the “result set loader” and “reader” abstractions, this change in format would
be completely hidden from client code. The “result set” abstractions provide high level
APIs that isolate clients from implementation, allowing changes such as this.

h4. Arrow Union Types

[Arrow|https://arrow.apache.org/docs/memory_layout.html] (see “Dense union type”) has
retained Drill’s union vector design: it contains:

* One child array for each relative type
* Types buffer…
* Offsets buffer…

Unlike Drill, Arrow also has a “Sparse union type” that omits the offsets buffer if the
child types are all of the same length.

The variant type is an opportunity for Drill to lead based on our extensive experience with
vectors in production systems. Once the variant type is proven in production, we can offer
it to Arrow as part of the Drill/Arrow integration.

h4. Backward Compatibility

While the actual code change is quite straightforward, the far larger challenge is backward
compatibility. Drill offers both JDBC and ODBC drivers. These drivers make use of Drills internal
vector storage format. Thus, any change to the vector format will appear on the wire and must
be understood by these clients.

Drill does not, unfortunately, provide a versioned API to deal with these issues.

One solution is to require uses to upgrade their drivers at the same moment that they upgrade
their servers. This is often impractical, however, and is not possible if a single driver
must work with multiple Drill servers upgraded at different times.

To allow this upgrade, Drill needs a version negotiation protocol between client and server:

* The client connects and identifies the range of API versions that it supports, with the
newest version being the version of the client itself.
* The server receives the message and computes the version of the session as the newest client
version the the server supports.
* The server returns this version to the client which switches to the selected API version.
* The server and client use only messages valid for the given API version. This may mean converting
data from one representation to another.

The above is pretty standard.

For the union vector, let’s say the variant alternative is introduced in version Y. If a
version X (older) client connects, the server converts the variant type to union format before
sending to the client.

Thus, before we can change the union vector (or, for that matter, any vector), we must release
clients that understand the version handshake protocol. Then, once those clients are deployed,
a following server version can make the vector changes.

Note that this same issue will arise (only in much more complex form) if Drill were to adopt

h4. SQL-level Variant Semantics

The union vector (and the proposed new “variant” vector) exist to hold a variety of types.
However, SQL is designed to work with just a single type. Therefore, we must consider not
just storage representation, but also query semantics.

A challenge is that neither JDBC nor ODBC were designed for variants, nor do most analytic
tools know how to interpret varying data types. Indeed, since these APIs and tools are designed
for relational data (in which the type of each column is known and fixed), then it is the
job of the query tool to determine the column type.

This means that, when using JDBC and ODBC, all union/variant processing must be done within
Drill itself, with the client seining a single, combined output type after some internal operation
to produce that combined type.

One simple use case is to handle type schema changes within an input. For example, in JSON,
a value might first present as an Integer, later as a Float. Or, a value might start small
enough for a Float, but later present values that require a BigDecimal.

In such cases, a variant type allows Drill to hold values that correspond to how the JSON
parser retrieved the values.

To use those values in SQL, however, the user must unify them, perhaps with a Cast. For example,
in the mixed-number case above, the user might cast the column to a decimal.

h4. Alternatives to the Union/Variant Types

Here, however, we can take a step back and ask a larger question. If the union/variant vector
is to handle schema changes, might it be better to simply push the final schema down to the
reader, and simply interpret the data as the final value at read time? That is, if we could
tell the JSON reader (say) that column “x” is a Decimal, then the reader can do the conversion,
saving all the complexity of a union (or variant) vector and casting.

One way to do this is to “push” cast operations into the reader by providing the reader
not just column names, but with the types as well. That is, projected columns are not just
names, they are (name, type) pairs.

The above cannot solve the {{SELECT *}} case, however, as the user has chosen not to specify
names (let alone types.)

A more general solution is to allow the user to specify the column types as metadata (as is
already done in all other query tools, perhaps via Hive.) Then, the user need not specify
the types via casts in each query. Because the types are known at read time, {{SELECT *}}
works fine. As a result, the need for a union/variant never arises.

Here it is worth pointing out that Drill must still be able to query data without a schema.
But, type conflicts may appear since Drill can’t predict the future. The user than makes
a decision that the easiest path forward for their own use case is to 1) live with the issue,
2) add casts to each query, 3) add casts to a per-file view, or 4) provide metadata that solves
the problem once and for all.

Given this there other cases where we actually do need the union type? Do we have compelling
use cases? If not, then the best path forward to fix the union type is simply to retire it
in favor of the type hints described above.

This message was sent by Atlassian JIRA

View raw message