drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5013) Heap allocation, data copies in UDLE write path for ExternalSortBatch
Date Tue, 08 Nov 2016 18:08:58 GMT
Paul Rogers created DRILL-5013:

             Summary: Heap allocation, data copies in UDLE write path for ExternalSortBatch
                 Key: DRILL-5013
                 URL: https://issues.apache.org/jira/browse/DRILL-5013
             Project: Apache Drill
          Issue Type: Improvement
    Affects Versions: 1.8.0
            Reporter: Paul Rogers
            Priority: Minor

The ExternalSortBatch (ESB) uses spill-to-disk to sort a large collection of records within
a limited memory footprint.

As part of writing data to disk, ESB writes each of a target byte buffer to disk. Since the
vector is stored in direct memory (not visible to an output stream), the code path first makes
a temporary on-heap copy.

In particular the code in `io.netty.buffer.PooledUnsafeDirectByteBuf` does the following:

    public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException {
        checkIndex(index, length);
        if (length != 0) {
            byte[] tmp = new byte[length];
            PlatformDependent.copyMemory(addr(index), tmp, 0, length);
        return this;

The result is that we 1) create a large number of on-heap objects, and 2) copy the data twice:
once from direct memory to the tmp buffer, and from the tmp buffer into the output stream's
own buffer.

Two optimizations are possible:

1. Copy the data byte-by-byte from the direct memory buffer to the output stream, or
2. Reuse the same tmp buffer across vector writes.

Since the code is in Netty, if we do either of the above, we'd have to write our own "getBytes"
(misnomer, really write bytes) method.

This message was sent by Atlassian JIRA

View raw message