orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Wu <ust...@gmail.com>
Subject Re: [Discussion] Base 128 variable integer encoding is not always good
Date Wed, 19 Sep 2018 03:48:19 GMT
Owen
  I have put the example data to reproduce the issue in
https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
numbers which are already zigzag-encoded using (val « 1) ^ (val » 63). The
low overhead representation of literals is exactly what we need for RLEv3.
We should also pay attention that zstd does not work well with LEB128 but
zlib can get better compression ratio with LEB128. There is no one-for-all
solution and we may come up with several optimal combinations of encoding
and compression settings.

Gopal
  DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. I
will take a look at the orc.encoding.strategy setting.

Thanks!
Gang

On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <owen.omalley@gmail.com>
wrote:

> Gang,
>    As you correctly point out, some columns don't work well with RLE.
> Unfortunately, without being able to look at the data it is hard for me to
> guess what the right compression strategies are. Based on your description,
> I would guess that the data doesn't have a lot of patterns to it and covers
> the majority of the 64 bit integer space. I think the best approach would
> be to make sure that RLEv3 has a low overhead representation of literals.
> So a literal mode something like:
>
> header: 2 bytes (literal, 512 values, size 64bit)
> data: 512 * 8 bytes
>
> So the overhead would be roughly 2/4096 = 0.005.
>
> Thoughts?
>
> On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <gopalv@apache.org>
> wrote:
>
> > Hi,
> >
> > >  From above observation, we find that it is better to disable LEB128
> > encoding while zstd is used.
> >
> > You can enable file size optimizations (automatically recommend better
> > layouts for compression) when
> >
> > "orc.encoding.strategy"="COMPRESSION"
> >
> > There are a bunch of bitpacking loops that's controlled by that flag
> > already.
> >
> > >     https://github.com/facebook/zstd/issues/1325.
> >
> > If I understand that correctly, a DIRECT_V2 would also work fine for the
> > numeric sequences in Zstd instead?
> >
> > Cheers,
> > Gopal
> >
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message