orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiening Dai" <xiening....@alibaba-inc.com>
Subject Re:Re: [Discussion] Base 128 variable integer encoding is not always good
Date Wed, 19 Sep 2018 01:19:02 GMT
I think here the bigger issue is the combination of zstd and LEB128 which results in much lower
compression ratio compared to Zlib. This is by design for zstd level 1.And according to the
answer from zstd community (see link from Gang), this only gets better after much higher level
(says 12).
I think fundermentally we should consider compression codec as a factor when we do the encoding.
Certain encoding mechanism work better with some compression algorithms, and the writer should
pick the best encoder based on compressor user chooses. 
Regarding to this specific issue, we could just disable LEB when zstd is chosen. But this
would have to introduce a new type of stream, something like direct_v1_no_leb. We probably
should extend the meta to better represent these differences.

from Alimail iPhone ------------------Original Mail ------------------Sender:Owen O'Malley
<owen.omalley@gmail.com>Send Date:Tue Sep 18 16:08:40 2018Recipients:Gopal Vijayaraghavan
<gopalv@apache.org>CC: <dev@orc.apache.org>, Xiening Dai <xiening.dai@alibaba-inc.com>Subject:Re:
[Discussion] Base 128 variable integer encoding is not always goodGang,   As you correctly
point out, some columns don't work well with RLE. Unfortunately, without being able to look
at the data it is hard for me to guess what the right compression strategies are. Based on
your description, I would guess that the data doesn't have a lot of patterns to it and covers
the majority of the 64 bit integer space. I think the best approach would be to make sure
that RLEv3 has a low overhead representation of literals. So a literal mode something like:
header: 2 bytes (literal, 512 values, size 64bit)
data: 512 * 8 bytes
So the overhead would be roughly 2/4096 = 0.005.

On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <gopalv@apache.org> wrote:

>  From above observation, we find that it is better to disable LEB128 encoding while
zstd is used.

You can enable file size optimizations (automatically recommend better layouts for compression)


There are a bunch of bitpacking loops that's controlled by that flag already.

>     https://github.com/facebook/zstd/issues/1325.

If I understand that correctly, a DIRECT_V2 would also work fine for the numeric sequences
in Zstd instead?



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message