drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Sort performance improvement
Date Mon, 02 Jan 2017 01:11:56 GMT
Hi All,

Happy New Year!

Recent experiments show that we can more than double in-memory sort performance by generating
tighter code in the “inner loops” of the sort and merge implementations. The improved
sort performance causes a 50% reduction in overall runtime (doubling of performance) for the
sample query: 10 million rows of randomly-generated integers. YMMV.

The details are in [1]. In short, generated code makes expensive use of layer upon layer of
function calls to fetch each value. The speed-up comes from eliminating 13 function calls
per compare operation. In the sort we do something like O(n log n) compares: so savings add
up quickly. For this test, the reduction is something like ~1.5 billion function calls.

Existing:

compare()
. SelectionVector2.getIndex( ) // x 2
. . DrillBuf.getChar( )
. . . DrillBuf.getShort()
. . . . PlatformDependent.getShort()
. doEval( ) // Generated, this is for required int
. . IntVector.getAccessor() // x 2
. . . Accessor.get()
. . . . DrillBuf.getInt()
. . . . . PlatformDependent.getInt()

Revised:

    public int compare(int leftIndex, int rightIndex) {
      int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
      int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
      int left = PlatformDependent.getInt(addr0 + (sv1<<2));
      int right = PlatformDependent.getInt(addr4 + (sv2<<2));

A similar change was made in the “MSorter”: the code that merges sorted batches in memory.

Interestingly, no additional performance gain was seen in replacing the PlatformDependent
calls with direct calls to Unsafe. Presumably that path is simple enough to be optimized away
by the JVM. 

This experiment hand-coded the replacement “SingleBatchSorter” and “MSorter". A next
step would be to revise the code generator to generate the required code. And, of course,
nullable, variable-length and array vectors will be more complex than the simple required
int vector used in the experiment.

This technique can be used anywhere we have compute-intensive inner loops in generated code.

- Paul

[1] https://github.com/paul-rogers/drill/wiki/Optimization-of-External-Sort-Performance


Mime
View raw message