drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@mapr.com>
Subject Addressing memory fragmentation in Drill
Date Tue, 13 Jun 2017 20:28:25 GMT
Hi All,

Those of you that were able to join the Drill Hangout today got a brief introduction to the
memory fragmentation issue we wish to resolve. For others, below is a very brief overview
of the issue. Please consult the documents in DRILL-5211 for more information. Since we are
proposing a number of changes, it would be great to get many eyes looking at both the problem
and solution.

Drill uses a two-tier memory allocator. Netty handles allocations up to 16 MB; Java Unsafe
handles allocations of 32 MB or larger directly from native memory. (All allocations are done
in power-of-two sizes.) When freeing blocks, 16 MB and smaller blocks go onto the Netty free
list, 32 MB and larger blocks go back to the native memory pool. Eventually, all memory is
free in the form of 16 MB Netty blocks. We try to do a 32 MB allocation, which fails, because
all native memory is allocated to Netty (where it is in Netty’s free list.) The result is
many GB of free memory, but an OOM error.

Many solutions are possible: extend the Netty block size, force Netty to release memory back
to the native pool, etc. It turns out that the Netty allocator is 1000x faster than the native
allocator so we would prefer to use the Netty allocator for most allocations, which rules
out many possible solutions.

Therefore, we have found that our best path forward is to limit individual value vectors to
16 MB in size. Various low-level changes enable this limit. (See PR 840, DRILL-5517.) On top
of that, we created a modified version of the “vector writers” that are size aware. Finally,
we created a new scan “mutator” that handles limits. (This structure follows the existing
structures already in Drill.)

Scanners can read data once, in the general case. So, how do we handle the case where we have
20 columns, we’ve copied the first 10 columns into vectors, but the 11th column overflows?
The new mutator implements an “overflow row” by creating a new, “look ahead” batch,
moving the partially-written overflow row to the new batch, and letting the reader complete
adding columns to the overflow row. The reader then sends the full batch downstream. On the
next call to read a batch, reading starts with the first row already in the new batch.

This change allows readers to handle vector limits transparently. But, each reader has implemented
vector writing its own way. Some use the vector writers, Parquet has its own vector writers,
some write to vectors (without the vector writers), and some bypass vectors entirely to write
directly into the underlying direct memory. So, we need to standardize on a single size-aware
mechanism. Plus, readers need to handle “missing” columns, implicit and partition columns,
etc. This common logic should also be standardized.

The resulting refactoring leaves Drill readers with only the task of reading data from a data
source and loading data into vectors using the new vector writers. A nice side effect of this
change is that readers become very simple, easy to write, and easy to test, which, in turn,
should encourage more people to contribute storage plugins.
Specs are posted to DRILL-5211 for all of the above. Please review at your convenience. Working
code exists for the above also, PRs will be issued one after another, as each depends on code
in a previous one. The specs point to my working branch for those that want an early peek
at the code without waiting for the PRs.

Once the readers limit vector sizes, we’ll need a solution for other operators such as flatten,
project and others that can potentially create large vectors. That is an open topic for which
we have only a very general outline of a solution.

An advantage of the vector-size-limit approach is that we can extend it to limit batch size
to improve Drill’s ability to manage memory. For example, receivers must accept three incoming
batches before back pressure kicks in. But, currently batches are of unlimited size, meaning
that receivers don’t know how much memory to allocate to buffer the required three batches.
A standard batch size will resolve this issue, among others.

The above is the proposal in a nutshell. Please consult the documents for details. To help
us track your comments, please post comment to DRILL-5211 instead of replying here.


- Paul

View raw message