drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@mapr.com>
Subject "Batch size control" and unit testing
Date Mon, 13 Nov 2017 23:24:41 GMT
Hi All,

Here is the next installment in the “batch size control” project update.

Drill has a great many operators. As we move forward, we must update them to use the new batch
size control framework. Unit testing becomes a major concern. This note explains how we address
that issue in this project.

The “classic” way to test Drill is to build the product, fire up the Drill server, and
use Sqlline to fire off queries. The problem of course, is that the edit-compile-debug is
glacially slow (five minutes). Testing is manual (copy/paste the query into Sqlline, visually
inspect the results.)

Another alternative is to run the very same query, but as a JUnit test. Drill has many such
tests. The “BaseTestQuery” framework and “TestBuilder” help. The newish “Cluster
Framework” makes it very easy to start an embedded Drillbit with the desired options and
settings, run a query, and examine the results. The edit-compile-debug cycle is much faster,
on the order of 10-20 seconds.

This is good, but we still run the entire Drill operator stack and throw queries at it. We
use use a file for input and capture query results as output.  But, we want much finer grain
testing. That is, we want true unit testing: isolate a component, feed it some input, and
verify its output.

A fact of Drill is that operators are tightly coupled with the fragment context which is coupled
with the Drillbit context which needs the entire server. What to do? One solution is to use
mocks, and, indeed, Drill has three solutions based on JMockit, Mockito, and Jinfeng’s handy
new “Mini-Plan” framework.

Mocks are handy, but it is cleaner and simpler to have code that can be tested in isolation
without mocks. The next step is the “sub-operator” test framework, the “RowSet” utilities
and the “context” refactoring that break the tight coupling with the rest of Drill, allowing
us to separate out an operator (after some simple changes to the code) to test in isolation.
We can now easily pump in a very large variety of inputs (such as Drill’s 30+ data types
in the 3 cardinalities) without having to set up a lot of overhead for each.

Still, however, many operators are internally complex and poking at them from the outside
is limiting. We want to test, say, not just the sort operator, as a whole, but we want to
exercise the bit of code that does the in-memory sort, or the one that writes batches to disk.
To do this, we must “disaggregate” each operator into a series of separately-testable
components, each with a clear API.

Refactoring operators can only be done for new operators, or when we need to make major changes
to an existing operator. As part of the “batch size control” project, we have created
a new version of the scan operator using this model.

Refactoring scan pointed out an opportunity to refactor the core operator code itself. Each
operator has three responsibilities:

* Implement the Drill iterator protocol.
* Hold a record batch.
* Details of the operator algorithm.

The next “batch size” PR will provide a new version of the base operator class that splits
responsibility into classes for the first two items, and an interface for the third. This
allows us to unit test the two classes once and for all. Per-operator, the focus is just the
operator implementation.

The core operator algorithm implementation is designed to be loosely coupled to the rest of
Drill, allowing complete unit testing without mocks. The scan operator revision, which we’ll
describe in the next note, makes use of this structure.


- Paul

View raw message