drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5275) Sort spill serialization is slow due to repeated buffer allocations
Date Tue, 28 Mar 2017 20:18:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15945873#comment-15945873

Paul Rogers commented on DRILL-5275:

Primarily a development issue; hard to test at the QA level.

> Sort spill serialization is slow due to repeated buffer allocations
> -------------------------------------------------------------------
>                 Key: DRILL-5275
>                 URL: https://issues.apache.org/jira/browse/DRILL-5275
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>              Labels: ready-to-commit
>             Fix For: 1.10.0
> Drill provides a sort operator that spills to disk. The spill and read operations use
the serialization code in the {{VectorAccessibleSerializable}}. This code, in turn, uses the
{{DrillBuf.getBytes()}} method to write to an output stream. (Yes, the "get" method writes,
and the "write" method reads...)
> The DrillBuf method turns around and calls the UDLE method that does:
> {code}
>             byte[] tmp = new byte[length];
>             PlatformDependent.copyMemory(addr(index), tmp, 0, length);
>             out.write(tmp);
> {code}
> That is, for each write the code allocates a heap buffer. Since Drill buffers can be
quite large (4, 8, 16 MB or larger), the above rapidly fills the heap and causes GC.
> The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of I/O, we
get only about 40 MB/s. Very likely because of excessive CPU cost and GC.
> The solution is to allocate a single read or write buffer, then use that same buffer
over and over when reading or writing. This must be done in {{VectorAccessibleSerializable}}
as it is a per-thread class that has visibility to all the buffers to be written.

This message was sent by Atlassian JIRA

View raw message