kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Hints about encoding and compression
Date Mon, 12 Dec 2016 06:02:21 GMT
Hi Nicolas,

Apologies for the slow response. Answers inline:

On Mon, Dec 5, 2016 at 8:21 PM, Nicolas Fouché <nfouche@onfocus.io> wrote:

> Hi,
> I'm evaluating Kudu and I'd need some hints about column encoding and
> compression.
> A- Does it make sense adding LZ4 compression to a field with Dictionary
> Encoding ?

This has a slightly complex answer. If the column has low cardinality, then
dictionary compression stores the codeword blocks (i.e the numeric indexes
into the dictionary) using bitshuffle encoding, which is inherently
LZ4-compressed. So, adding LZ4 on top will do nothing except add overhead.

The complexity comes in that the dictionary encoding implementation
automatically falls back to "PLAIN" if the cardinality is too high to
create an effective dictionary. In that case, the LZ4 compression would be
useful (just as it would on PLAIN).

Given this, I'm hoping to work on a patch very soon which allows you to
specify LZ4 encoding, and it will only take effect in the fall-back case.
See KUDU-1600 for more info on this.

> B- Does it make sense adding LZ4 compression to a field with Run-Length
> Encoding ?

Probably wouldn't help much. LZ4 only compresses repeated sequences, and
typically the only cross-row sequences you'd have in an integer column
would be runs, which are already well compressed by RLE. Something like
ZLIB encoding (which does huffman coding) would be effective on top of RLE,
but at a pretty high cost.

> C- I have a non-key column with randomly distributed INT32 numbers, I
> guess I won't add an encoding. But what about compression ? Would LZ4 make
> sense ? Would it slow down aggregations (`SUM`) ?

If they're truly randomly distributed, then no compression or encoding will
be able to do much with them. If they're randomly distributed but tend to
be clustered together into a particular range within the whole INT32 domain
(eg something like timestamps) then BITSHUFFLE is probably a good bet.

Todd Lipcon
Software Engineer, Cloudera

View raw message