drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill issue #914: DRILL-5657: Size-aware vector writer structure
Date Sat, 16 Sep 2017 22:25:51 GMT
Github user paul-rogers commented on the issue:

    The second group of five commits refines the result set loader mechanism.
    ### Model Layer Revision
    Drill is unique for a query engine in that it handles structured types as first-class
data types. For example, Drill supports maps (called “structs” by Hive and Impala), but
also supports arrays of maps. Drill support simple scalars, and also arrays of scalars. Put
these together and we have maps that contain arrays of maps that contain arrays of scalars.
The result is a tree structure described in the Drill documentation as modeled on JSON.
    The prior version used a “model” layer to construct internal structures that models
the tree-structure of Drill’s data types. The model “reified” the structure into a set
of objects. While this worked well, it added more complexity than necessary, especially when
dynamically evolving a schema or working out how to handle scan-time projection.
    This version retains the model layer, but as a series of algorithms that walk the vector
container structure rather than as a separate data structure. The model layer still provides
tools to build readers for “single” and “hyper” vectors, to extract a metadata schema
from a set of vectors, to create writers for a “single” vector container, and so on.
    Replacing the object structure with algorithms required changes to both the row set abstractions
and the result set loader.
    The key loss in this change is the set of “visitors” from the previous revision. The
reified model allowed all visitors to use a common structure. The new solution still has visitors,
but now they are ad-hoc, walking the container tree in different ways depending on whether
the code can work with columns generically, or needs to deal with individual (single) vectors
or hyper vectors.
    ### Revised Result Set Loader Column and Vector State
    To understand the need for many of the changes in this commit, it helps to take a step
back and remember what we’re trying to do. We want to write to vectors, but control the
resulting batch size.
    #### Background
    Writing to vectors is easy if we deal only with flat rows and don’t worry about batch
    * Vectors provide `Mutator` classes that write to single vectors.
    * A set of “legacy” vector writers are available, and are used by some readers.
    * Generated code uses `Mutator` and `Accessor` classes to work with vectors.
    * The “Row Set” classes, used for testing, provides a refined column writer to populate
    The above are the easy parts. Some challenges include:
    * None of the above limit batch memory, they only limit row count.
    * Writing directly to vectors requires that the client deal with the complexity of tracking
a common position across vectors.
    * Drill’s tree structure makes everything more complex as positions must be tracked
across multiple repetition levels (see below).
    The “result set loader” (along with the column writers) provides a next level of completeness
by tackling the vector memory size problem, for the entire set of Drill structured types.
    #### Overflow and Rollover
    The key trick is to handle vector “overflow” seamlessly shifting writes, mid-row,
from a full batch, to a new “look-ahead” batch. The process of shifting data is called
    To implement rollover, we need to work with two sets of vectors:
    * The “active” set: the vectors “under” the writers and returned downstream.
    * The “backup” set: holds the buffers not currently in use.
    During an overflow event, buffers are shifted between the active and backup vectors:
    * On overflow, the full buffers reside in the active vectors.
    * After rollover, the full buffers reside in the backup vectors, and new buffers, now
holding the in-flight row (called the “look-ahead” row), reside in the active vectors.
    * When “harvesting” the full batch, the full buffers and look-ahead vectors are exchange,
so the full buffers are back in the active vectors.
    * When starting the next batch for writing, the look-ahead row is shifted from the backup
vectors into the active vectors and the cycle repeats.
    #### Column and Vector State
    When writing without overflow, we need only one vector and so the usual Drill vector container
is sufficient. With overflow, we have two sets of vectors, and must perform operations on
them, so we need a place to store the data. This is the purpose of the “column state”
and “vector state” classes.
    Think of the overall result set loader structure a has having three key parts:
    * Result set loader: manages the entire batch
    * Column writers: accepts writes to vectors
    * Column state: manages the vectors themselves.
    The result set loader and column writers are part of the public API. Column state is an
internal concept.
    Drill is columnar. Each column in a query is represented by a column state object. This
object tracks the state of the column: normal, overflow, etc.
    Every column state must hold two vectors. The vector state object holds the active and
backup vectors, and performs the required operations on these two vectors.
    Vector state is separate from column state because some columns use multiple vectors.
For example, an array of scalars has an offset vector and a data vector. This column is represented
by a single column state object, but two vector state objects.
    Maps are special: they have a column state for the map as a whole, along with a “tuple
state” to represent the columns within the map. The row itself is a tuple, and so also uses
the tuple state object.
    Repeated maps have not just the tuple state, but also an offset vector that indexes into
the tuple.
    #### Schema Subtlety
    Suppose a application uses the result set loader to write a batch up through overflow.
Suppose that 1000 rows fit into the batch and the 1001st causes overflow. The above description
handles this case.
    Now suppose that on row 1001 the application adds a new column. Should the column appear
within the batch that contains the first 1000 rows? One can argue either way. Consider this
argument. If the application just used row counts, the application would have read 1000 rows
and sent the batch downstream. Then, on the 1001st row (the first row of the next batch) the
application would have added the new column.
    Using this reasoning, we adopt a rule that a new column appears only in the batch that
contains the row in which the column was added. So, in the overflow case, the new column does
not appear in the first batch, but does appear in the second.
    To make this work, the column state tracks schema version numbers and can hold onto columns
that are not yet visible. The result set loader builds an “output” schema that contains
only those columns that are visible. This means that the vector container sent downstream
*must* be distinct from the structure used to hold vectors in the result set loader.
    For this reason, the tuple and column states are the mechanism to hold the entire schema,
while the output vector container is built to include only the visible columns.
    This mechanism is even more necessary in the next commit which will add projection back
into the result set loader mechanism.
    The point is: the tuple and column states allow the result set loader to hold internally
a larger set of columns than are made visible to the downstream operators via the output vector
    ### “Torture” Test
    This commit adds a new result set loader test called the “torture test.” This single
test combines all the complexities that the result set loader must handle:
    * Deeply nested repeated structures
    * Null values
    * Missing values
    * Omitted rows
    * Vector Rollover
    This test has lived up to its name, revealing several subtle bugs in each mechanism. These
bugs lead to increased understanding of the requirements for the underlying mechanisms, which
resulted in many subtle changes to the accessor and result set loader layers. Little new functionality
was added, but the existing functionality was refined to handle many odd cases.
    ### Repetition Levels
    The tests mostly revealed issues with managing offsets in the layers of repeated types.
Drill has two kinds of nesting in its tree structure: structures and arrays. Structures just
add a level to the name space. (For example, a top-level map introduces a name space, “m”,
say, with its columns nested within that name space: “m.a”, “m.b”.) The other kind
of nesting introduces a “repetition level.” Drill has four kinds of repetition levels:
    * Rows within a row set (batch)
    * Scalars within an array of scalars
    * Maps (structs) within an array of structs (repeated map)
    * Lists (not yet supported in this code)
    Getting all the offset vectors, pointers, and rollover logic to work was non-trivial.
A persistent source of errors is the fact that offsets written for a row are one head of the
row itself. That is, row 10 writes its end offset in position 11, and so on. This requires
many, many special cases.
    ### Rollover Refinement
    When a vector overflows, data must be “rolled over” from the in-flight batch to a
new one. Rollover is simple with “flat” rows. But, again, many subtle problems crop up
when dealing with repetition levels and their corresponding offset vectors and nested indexes.
The code here refines rollover handling with several new “writer events” to prepare for,
and clean-up after rollover.
    Rollover requires work at both the vector and writer level. Work is divided as follows:
    * A “preRollover()” event in the writers prepares vectors for overflow by filling
empties and setting value counts.
    * The result set loader “column state” and “vector state” classes move data from
one set of vectors to another.
    * The “postRollover()” even in the writers resets writer indexes, addresses and other
pointers to reflect the new data positions.
    ### Even More Javadoc
    As the code has stabilized, it has become worthwhile describing the structure in detail
in Javadoc, complete with “ASCII-art” to illustrate some of the concepts when working
with Drill’s tree-structured vector model.
    ### More Tests
    Additional lower-level “row set” tests were added for various bits of the code. Also,
a couple of tools for printing structures were added to aid debugging. (It is easier to fix
thing when we can visualize the complex data structures and vector contents.)
    ### Other Changes
    A few nested classes have grown larger and were pulled out into their own files.


View raw message