orc-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: ORC double encoding optimization proposal
Date Mon, 19 Mar 2018 22:45:03 GMT
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression codec (e.g.
ZSTD and Brotli), we proposed to prompt FLIP as the default encoding for ORC double type to
move this feature forwards.

Since we're discussing these, I'm going to summarize my existing notes on this, before you

FLIP & SPLIT are the two best algorithms from different ends of the spectrum & they
have their own strengths.

FLIP was designed with Intel C++ code in mind, where the Java implementation is somewhat slower
today, the C++ impl should be very fast.

In an ideal world, the entire FLIP should unroll into a single instruction - PSHUFB (AVX512
will be able to unpack 8x8x8 matrix, this is common in many platforms due to the similarity
to RGBA data transforms).

At some point, we'll be able to rewrite it using Long8 native types (assuming JIT can understand
a shuffle op).


Here's the tools to run through your own data to determine if FLIP will work for you (the
byteuniq UDF).


I haven't run HEPMASS through that script, but you can see the bit level has even neater entropy
skews than the whole byte, but the byte packing will offer enough dictionary items.


shows how the LZ77 in Zlib picks the matches mostly detecting the 7 byte patterns instead
of detecting the 8 byte patterns which definitely are common enough (we can have much tighter
symbol detection in LZ77, though I'm more interested in poking about the Zstd search depth

There's more disk savings that can come out of FLIP.

> It's compression friendly unlike Split and FPC. 

SPLIT is very memory bandwidth friendly and is probably the best format to cache in-memory,
because it doesn't explode in size when Zlib decompressed into a buffer.

SPLIT+LLAP cache is likely to be faster than FLIP+LLAP cache, purely from the memory bandwidth
needs of the loops & the cache overflow rate of FLIP (hitting the cache saves the entire
Zlib CPU, which is about ~30%).

The core perf issue with the SPLIT algorithm is that it doesn't decompose neatly at the bit-level
in the java memory model - the current loop has a lot of room for improvement.

Basically, right now there are at least 3 branches for the SPLIT and 1 for the FLIP - nextNumber()
is basically assembling 1 at a time, instead of 8 at a time.

Purely speaking from a decode loop perspective, we have a lot of performance improvements
to be made in the SPLIT algorithm which are pending - that should ideally come as part of
RLEv3 & indirectly make the SPLIT reader faster.

With a 8x unrolled impl SPLIT is going to catch up in the total decode rate & I started
ORC-187 after digging into some of those branching loops.

For the other two next() calls, this is the equivalent unrolling that was done by Prasanth
with Integers.


The Long -> Double has to similarly do more register work instead of calling nextNumber()
one at a time. And I like SPLIT because it is very natural in its implementation & therefore
easier to parallelize than FPC.

The current numbers aren't the final numbers by a long shot, but FLIP and SPLIT are the ones
where I feel like more work is useful.


View raw message