orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Orc v2 Ideas
Date Tue, 09 Oct 2018 20:04:39 GMT

>    Zstd with particular settings doesn’t work well on one particular non-public dataset
after it is encoded by RLE. 
>    I’ve suggested that you try tuning the zstd compression to find a set of parameters
that work well with RLE. Take a look at how we tune the zlib compression based on the type
of the stream and column.

We've had an almost entirely similar discussion for Zlib when comparing against SNAPPY before
- we don't use the same Zlib variant for all columns.

ZStd has similar variants which are well suited for different streams of data - for example
using btlazy2.

Decompression performance was the biggest concern that came up in those discussions, so there
is a 2-flag combo (encoding.strategy and compression.strategy).

Both are set to SPEED right now, because that's what most people want out of ORC data - but
if the goals are different, then those flags should translate into Zstd strategies (the strategies
don't need to be recorded in the binaries, unlike dictionaries).

An efficient literal representation for Zstd is definitely something to consider - I haven't
dug into because I'm currently missing a tool like "infgen" for Zstd to walk through the hex.


View raw message