drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From paul-rogers <...@git.apache.org>
Subject [GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Date Fri, 18 Aug 2017 05:53:22 GMT
GitHub user paul-rogers opened a pull request:

    https://github.com/apache/drill/pull/914

    DRILL-5657: Size-aware vector writer structure

    This large PR provides another two levels of foundation for size-aware vector writers
in the Drill record readers. It combines code from two previous PRS:
    
    * PR 866 - DRILL-5657: Implement size-aware result set loader
    * PR 887 - DRILL-5688: Add repeated map support to column accessors
    
    The PR then goes on to integrate the two prior PRs and provide additional functionality.
    
    Like the two previous PRs, this one is divided into commits to group the work.
    
    1. Accessor layer
    2. Row Set layer
    3. Tuple and Column Model layer
    4. Row Set Loader layer
    5. Secondary changes
    
    Much of the material below appears in Javadoc throughout the code. The material here is
not meant to replace that documentation. Instead, it is meant to provide the “big picture”:
placing the bits and pieces in context and pointing out interesting functionality to explore
in each layer.
    
    ## Commit 1: The Accessor Layer
    
    The first commit provides the core of the mechanism: the writers that put data into vectors,
and the readers that retrieve that data. The version here is an evolution of the version provided
in an earlier PR a few months ago.
    
    ### Overview of the Drill Vector Data Model
    
    The code will make much more sense if we start with a review of Drill’s complex vector
data model. Drill has 38+ data (“minor”) types as defined in the [proto buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto)
definition. Drill also has three cardinalities (“modes”) defined in the same file. The
result is over 120+ different vector types. Then, when you add maps, repeated maps, lists
and repeated lists, you rapidly get an explosion of types that the writer code must handle.
    
    Vectors can be categorized along multiple dimensions:
    
    * By data (minor) type
    * By cardinality (mode)
    * By fixed or variable width
    
    A repeated map, a list, a repeated list and any array (repeated) scalar all are array-like.
Nullable and required modes are identical (single values), but a nullable has an additional
is-set (“bit”) vector.
    
    A key contribution of this PR is the data model used to organize vectors.
    
    * Both the top-level row, and a Drill map are “tuples” and are treated similarly in
the model.
    * All non-map, non-list (that is, scalar) data types are treated uniformly.
    * All arrays (whether a list, a repeated list, a repeated map, or a repeated scalar) are
treated uniformly.
    
    ### Accessor Data Model
    
    The above leads to a very simple, JSON-like data model, introduced in this PR.
    
    * A tuple reader or writer models a row. (Usually via a subclass.) Column are accessible
by name or position.
    * Every column is modeled as an object.
    * The object can have an object type: scalar, tuple or array.
    * An array has a single element type (but many run-time elements)
    * A scalar can be nullable or not, and provides a uniform get/set interface.
    
    This data model is similar to; but has important differences from, the prior, generated,
readers and writers. 
    
    The object layer is new: it is the simplest way to model the three “object types.”
An app using this code would use just the leaf scalar readers and writers.
    
    Although there is quite a bit of code change here to provide the new structure the core
functionality of reading and writing to vectors has not changed much. And, this code has extensive
unit tests, which should avoid the need to "mentally execute" each line of code.
    
    See the classes in `org.apache.drill.exec.vector.accessor` for details. In particular,
please see the `package-info.java` file in that package for more information.
    
    As before, the top package provides a set of interfaces; the inner packages provide the
implementation. The `ColumnAccessors.java` template generates the per-vector code. Warning:
this template has become quite cryptic: the best bet for review is to generate the Java code
and review that.
    
    ### Writer Performance
    
    During previous review, we discussed ways to optimize writer performance. This PR has
two improvements:
    
    * Completely rework the writers to minimize code steps
    * Rework the “column loaders” to eliminate them: instead of two additional method
calls, the “loader” now uses the column writers directly.
    
    Much behind-the-scenes rework was needed to accomplish the above.
    
    Column readers, however, were left with their existing structure; we can apply the same
optimizations to the readers in a later PR.
    
    While doing the performance optimization, it became clear we can adopt a major simplification.
Writers evolved from having three versions (nullable, required, repeated) to having a single
version to handle all three cases. Some items of note:
    
    * The writers bypass DrillBuf and the UDLE to needed writes to direct memory.
    * The writers buffer the buffer address and implement a number of methods to synchronize
that address when the buffer changes (on a new batch or during vector resize).
    * Writing require a single bounds check. In most cases, the write is within bounds so
the single check is all that is needed.
    * If the write is out of bounds, then the writer determines the new vector size and performs
the needed reallocation. To avoid multiple doublings, the writer computes the needed new size
and allocates that size directly.
    * Vector reallocation is improved to eliminate zeroing the new half of the buffer, data
is left “garbage-filled.”
    * If the vector would grow beyond 16 MB, then overflow is triggered, via a listener, which
causes the buffer to be replaced. The write then continues.
    * Offset vector updates are integrated into the writers using an `OffsetVectorWriter`.
This writer caches the last write position so that each array write needs a single offset
update, rather than the read and write as in previous code.
    * The writers keep track of the “last write position” and perform “fill-empties”
work if the new write position is more than one position behind the last write. All types
now correctly support “fill-empties” (before, only nullable types did so reliably.)
    * Null handling is done by an additional writer layer that wraps the underlying data writer.
This avoids the need for a special nullable writer: the same nullable layer works for all
data types.
    * Array handling is done similarly: an array writer manages the offset vector and works
the same for repeated scalars, repeated maps and (eventually) lists and repeated lists.
    
    Since vector overflow is now incorporated directly into the writers, this PR backs out
the `setScalar()` and related methods added to value vectors in a previous commit.
    
    ### Supporting Classes
    
    A new `TupleMetadata` class is a superset of the existing `BatchSchema`, but also provides
"classic" tuple-like access by position or name. Eventually, this will also hold additional
information such as actual size and so on (information now laboriously rediscovered by the
"record batch sizer.") Since the accessors use a "tuple" abstraction to model both rows and
maps, the tuple metadata provides this same view. The top-most tuple models the row. Columns
within the row can be maps, which have their own tuple schema, and so on.
    
    `TupleNameSpace` moves locations (so it can be used in the vector package) but otherwise
remains unchanged.
    
    `DrillBuf` provides an experimental `putInt()` method that does bounds checking and sets
a value, to minimize calls. This will probably move into the writer in a later PR.
    
    This PR fixes DRILL-5690, a bug in repeated vectors that did not pass along Decimal scale
and precision. See `RepeatedValueVectors.java`.
    
    `MaterializedField` changes to add an `isEquivalent()` method to compare two fields, ignoring
internal (`$offset$`, `$bits$`, etc.) vectors.
    
    ## Commit 2: Row Set Layer
    
    The “Row Set” family of classes allow tests to quickly build and analyze vector containers
in the form of a “row set” - a set of rows. In prior commits, the row set abstraction
was the primary client of the accessors (readers and writers.) In this commit, much functionality
shifts to be shared with the new “loader” abstraction intended or production code.
    
    Key changes in response to the accessor changes:
    
    * The reader and writer are moved to separate files.
    * Row sets now use a common “model” to describe a vector tree (more below).
    * Static factory methods were added to hide constructor complexity.
    * The `RowSetBuilder` and `RowSetComparison` test tools added support for repeated maps.
    * Code to handle generic object writing moved from the `RowSetBuilder` into the accessors.
    * The old `RowSetSchema` evolved to become the `TupleMetadata` mentioned above.
    * Tests were greatly enhanced to test all modes of all supported scalar types, as well
as the new JSON-like structure.
    
    As before the row set unit tests exercise the row set classes themselves, but also are
the mechanism to exercise the accessor classes.
    
    See the `org.apache.drill.test.rowSet` and its sub-packages for details.
    
    ## Commit 3: Tuple and Column Models
    
    In the previous version, the row set classes had complex logic to figure out what kind
of accessor to create for each vector. This became overly complex. In the previous PR, the
row set "parses" a vector container to create "storage" objects that represent tuples and
columns. A column can, itself, be a tuple. (Note: there is no need to model lists since lists
are just vectors at this level of abstraction, so need no special handling.)
    
    In this PR, that concept evolves again to become a “tuple model” and “column model”:
a tree-structured representation of the structure of a Drill row. The model cleanly represents
things like maps, repeated maps, list of repeated maps that contain arrays, etc.
    
    Think of the model as the “backbone” to work with a batch vector tree. (“Tree”
because maps and lists contain other vectors.) Ideally, the vectors themselves would provide
this backbone. But, vectors are shared across operators, so that structure created by one
operator would conflict with structure needed by another.
    
    With this change, accessor creation is a simple matter of walking a tree to assemble the
JSON-structure.
    
    This structure is also used to create a batch's vectors from a schema using “visitor”
classes, to allocate vectors based on metadata and other chores needed in the result set loader.
    
    See `org.apache.drill.exec.physical.rowSet.model`. The top package contains interfaces.
The two sub packages provide implementations for single batches and hyper batches. (Hyper
batches are used only in the test “RowSet” abstractions at present.)
    
    ### Tuple and Column Metadata
    
    Also in this commit is a new “extended” metadata model that replaces earlier versions
in the row set and previous “loader” commits.
    
    For reasons that will become clear in the next PR, the scan batch ends up doing quite
a bit of semantic analysis to map from the select list and the table schema to the result
schema. Drill provides a BatchSchema class that is useful, but limited in this context. The
metadata layer helps to solve this problem by allowing fast access by both name and position,
and allows the schema to grow dynamically.
    
    The dynamic aspect appears in this PR as well: the “loader” API (see below) allows
adding columns as the app proceeds; and that action internally adds new columns to the batch
schema.
    
    The metadata classes evolved from an earlier version in the row set abstractions. That
code was extracted and extended to act as the foundation to create the metadata layer.
    
    The metadata model is simple: it represents tuples (rows, maps) and columns. The metadata
is intended to provide addition information beyond what exists in the `MaterializedField`
class. In particular, if provides hints about VarChar width and array size. It also provides
many useful methods to simplify working with columns.
    
    Eventually, it would be good to combine the new metadata classes with the existing `BatchSchema`
and `MaterializedSchema`. But, that is a big change that must wait for another day.
    
    ## Commit 4: The Result Set Loader Layer
    
    We now finally get to the ultimate goal of this PR: to introduce a new “result set loader.”
This abstraction is an evolution of the “Mutator” class in the scan batch. (The mutator
manages a batch; some readers used the existing “writers” with the mutator, others used
different techniques such as working directly with native memory.)
    
    A result set loader loads a set of tuples (AKA records, rows) from any source (such as
a record reader) into a set of record batches. The loader:
    
    * Divides records into batches based on a maximum row count or a maximum vector size,
whichever occurs first. (Later revisions may limit overall batch size.)
    * Tracks the start and end of each batch.
    * Tracks the start and end of each row.
    * Provides column loaders to write each column value.
    * Handles overflow when a vector becomes full, but the client still must finish writing
the current row.
    
    The original Mutator class divided up responsibilities:
    
    * The Mutator handled the entire record batch
    * An optional VectorContainerWriter writes each record
    
    The result set loader follows this same general pattern.
    
    * The result set loader handles the entire record batch (really, a series of batches that
make up the entire result set: hence the name.)
    * The TupleLoader class provides per-tuple services which mostly consists of access to
the column loaders.
    * A tuple schema defines the schema for the result set (see below.)
    
    To hide this complexity from the client, a ResultSetLoader interface defines the public
API. Then, a `ResultSetLoaderImpl` class implements the interface with all the gory details.
Separate classes handle each column, the result set schema, and so on.
    
    Recall that the overall goal of this PR is to handle memory fragmentation by limiting
vectors to 16 MB in size. The Loader realizes that goal by silently handling vector overflow
by moving the “overflow” row to a new “look-ahead” batch. The complete batch is then
sent downstream, and the next read starts with the overflow row as the first row of the next
batch.
    
    This version is a bit simpler than the previous one because this one leverages the model,
metadata and accessor layers for much of the work.
    
    ### Loader API
    
    The actual loader API turns out to be quite simple: another important goal to allow third
parties to more easily create custom “record readers.” For example:
    
    ```
     void writeABatch() {
       RowSetLoader writer = ...
       while (writer.start()) {
         writer.scalar(0).setInt(10);
         writer.scalar(1).setString("foo");
         ...
         writer.save();
       }
     }
    ```
    
    Note that, in this version, the `writer` above is defined in the result set loader, but
the writer is simply a tuple writer defined in the accessor layer. By the time we get to the
`setInt()` or `setString()` methods, the code does a single method call to get into position
to write the data. This eliminates the extra “column loader” layer of the prior version
of the result set loader.
    
    ### Writer and Loader Integration
    
    The column writers are low-level classes that interface between a consumer and a value
vector. To create the tuple loader we need a way to bind the column writers to the result
set loader. For this, we use a pair of listeners.
    
    * A tuple writer listener handles requests to add a new column at run time.
    * A column listener handles vector overflow.
    
    In both cases, the app works with the low-level accessors, but the listener allows events
to be fed back to the loader which handles the complex details of dynamic schema evolution
and of vector overflow (in fact, it can handle both at the same time.)
    
    To handle vector overflow, each “set” method:
    
    * Does a single bounds check to determine if the value is in the range of the current
buffer.
    * If not, relocates the buffer, if possible.
    * If the vector is full (reached the size limit), calls the listener to handle overflow.
    * The listener, via the vector model described earlier, dispatches the request to the
loader.
    * The loader handles overflow, creating a new vector.
    * The writer then writes the value into the new or existing buffer.
    
    ### Vector Overflow Logic
    
    The heart of this abstraction is that last point: the ability to detect when a vector
overflows, switch in a new vector, and continue writing. Several tricks are involved.
    
    Suppose we have a row of five columns: a through e. The code writes a and b. Then, c overflows.
The code can’t rewrite a and b. To handle this, the tuple loader:
    
    * Creates a new, small set of vectors called the “overflow batch”
    * Copies columns a and b from the current batch to the overflow batch.
    * Writes column c to the overflow batch.
    * Allows the client code to finish writing columns d and e (to the overflow batch).
    * Reports to the client that the batch is full.
    
    Note that the client is completely unaware that any of the above occurred: it just writes
a row and asks if it can write another.
    
    ### Skipping Columns
    
    The loader must also handle a reader, such as Parquet, that skips columns if they are
null. There were bugs in Drill’s vectors for this case and temporary patches were made in
a number of places to make this work. The trick also should work for arrays (a null array
is allowed, Drill represents it as an empty array.) But, this code also was broken. For good
measure, the code now also allows skipping non-null columns if a good “empty” value is
available: 0 for numbers, blank for strings. This behavior is needed for the CSV reader; if
a line is missing a field, the CSV reader treats it as an empty (not null) field.
    
    ### Result Set Schema
    
    The tuple loader is designed to handle two kinds of tables: “early schema” (such as
Parquet and CSV) define the schema up front. “Late schema” (such as JSON) discover the
schema during reading. The tuple loader allows either form, and, in fact, uses the same mechanism.
    
    Consumer of batches will, of course, want to know that the schema changed. Providing a
simple flag is muddy: when should it be reset? A better solution is to provide a schema version
which is incremented each time a column is added. (Columns cannot be removed or changed —
at least not yet.)
    
    ### Internal Plumbing
    
    The loader assembles the many pieces described earlier:
    
    * The vectors themselves
    * The writers (tuples, objects, arrays, scalars)
    * The vector model (tree)
    * Tuple and column metadata
    * The loader logic for handling overflow, etc.
    
    The solution is based on a strict separation of concerns via layering and loose coupling.
    
    * The vector model is the “backbone” that coordinates the other pieces.
    * Metadata provides instructions for building the vector model.
    * The vector model builds the corresponding vector and writer.
    * The writer listener sends events (see above) to the vector model.
    * The vector model dispatches events to the loader via a “coordinator” interface.
    * The loader performs batch, row and column operations via the vector model.
    
    Please see the extensive Javadoc description in the `org.apache.drill.exec.physical.rowSet`
package, especially in the `package-info.java` file in the `impl` sub-package.
    
    ### Result Vector Cache
    
    Above we mentioned that the tuple loader allows schema changes on the fly. As the next
PR will make more clear, downstream operators want a fixed set of vectors. To assist with
this, the tuple loader uses a “result vector cache”. Let’s say a scanner reads two JSON
files with the same schema. The first crates the schema and vectors. The second is obligated
to use the same vectors. This is a royal pain. But, the vector cache does it automatically:
when the tuple loader adds a new column, it checks if the vector already exists in the cache
and reuses it. If not there, the cache adds it and returns it so that it is there the next
time around.
    
    ### Unit Tests
    
    All of the above is pretty thoroughly tested via unit tests. In fact, the unit tests are
a good place to start (now I tell you!) in order to see how client code uses the various abstractions.
    
    The bit of unit test structure that handled system options turned out to be wrong. Modified
it to use the defaults defined in the system option manager, which required changing the visibility
of the defaults table.
    
    ## Commit 5: Collateral Damage
    
    The last commit contains various other changes, mostly reflecting the changes above.
    
    Some unit tests were updated to use new features which become available in this PR. See
TestFillEmpties and TestVectorLimits.
    
    The `equals()` method in BatchSchema is badly broken. Cleaned it up some. But, didn’t
want to change it too much in case anything depends on the current, broken, semantics. So,
added a new `isEquivalent` method to provide the correct semantics. Added an `isEquivalent()`
method to the MaterializedField as well that will ignore the “implementation” columns
that hang off of types such as nullables, repeated, etc. That is, two repeated columns are
identical if their type is identical, regardless of whether one has the “$offsets” child
or not.
    
    ## Open Issues
    
    This PR provides most of the required functionality. However, a few bits and pieces are
still in progress and will appear in subsequent commits and/or PRs:
    
    * Full testing of repeated maps
    * Support for lists and repeated lists (will reuse structure from repeated maps)
    * Column projection (was in an earlier version, new version needed for this structure)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/paul-rogers/drill DRILL-5657-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/914.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #914
    
----
commit 68c0520492fc053eb74a9f02e9fa3f01be53ab83
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-18T05:41:30Z

    DRILL-5657: Size-aware vector writer structure
    
    Commit 1: The vector and accessor layer

commit d2f7b2050764b8d4630158cd34dde0291377d3d7
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-18T05:43:54Z

    Commit 2: Row Set layer
    
    Includes the test-oriented row set abstractions as well as unit tests
    for the accessor layer

commit 7aff7a791f786db24853752c91031ee9121c8de3
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-18T05:47:10Z

    Commit 3: The tuple and column models
    
    Also includes the implementation of the tuple and column metadata.

commit 4ef94ffd15f0570398a4ab39c9bffc77bc777473
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-18T05:48:00Z

    Commit 4: The "result set loader" layer

commit cf2973278bc7aa29a5cf28fd320f84b08bd49743
Author: Paul Rogers <progers@maprtech.com>
Date:   2017-08-18T05:49:13Z

    Commit 5: Collateral damage
    
    Includes tests that had to change when the row set interface changed
    and a number of test cleanup items.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message