drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5275) Sort spill serialization is very slow
Date Sun, 19 Feb 2017 20:02:44 GMT
Paul Rogers created DRILL-5275:
----------------------------------

             Summary: Sort spill serialization is very slow
                 Key: DRILL-5275
                 URL: https://issues.apache.org/jira/browse/DRILL-5275
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
             Fix For: 1.10.0


Drill provides a sort operator that spills to disk. The spill and read operations use the
serialization code in the {{VectorAccessibleSerializable}}. This code, in turn, uses the {{DrillBuf.getBytes()}}
method to write to an output stream. (Yes, the "get" method writes, and the "write" method
reads...)

The DrillBuf method turns around and calls the UDLE method that does:

{code}
            byte[] tmp = new byte[length];
            PlatformDependent.copyMemory(addr(index), tmp, 0, length);
            out.write(tmp);
{code}

That is, for each write the code allocates a heap buffer. Since Drill buffers can be quite
large (4, 8, 16 MB or larger), the above rapidly fills the heap and causes GC.

The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of I/O, we get
only about 40 MB/s. Very likely because of excessive CPU cost and GC.

The solution is to allocate a single read or write buffer, then use that same buffer over
and over when reading or writing. This must be done in {{VectorAccessibleSerializable}} as
it is a per-thread class that has visibility to all the buffers to be written.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message