orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: [Discussion] Base 128 variable integer encoding is not always good
Date Tue, 18 Sep 2018 23:08:25 GMT
Gang,
   As you correctly point out, some columns don't work well with RLE.
Unfortunately, without being able to look at the data it is hard for me to
guess what the right compression strategies are. Based on your description,
I would guess that the data doesn't have a lot of patterns to it and covers
the majority of the 64 bit integer space. I think the best approach would
be to make sure that RLEv3 has a low overhead representation of literals.
So a literal mode something like:

header: 2 bytes (literal, 512 values, size 64bit)
data: 512 * 8 bytes

So the overhead would be roughly 2/4096 = 0.005.

Thoughts?

On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <gopalv@apache.org>
wrote:

> Hi,
>
> >  From above observation, we find that it is better to disable LEB128
> encoding while zstd is used.
>
> You can enable file size optimizations (automatically recommend better
> layouts for compression) when
>
> "orc.encoding.strategy"="COMPRESSION"
>
> There are a bunch of bitpacking loops that's controlled by that flag
> already.
>
> >     https://github.com/facebook/zstd/issues/1325.
>
> If I understand that correctly, a DIRECT_V2 would also work fine for the
> numeric sequences in Zstd instead?
>
> Cheers,
> Gopal
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message