orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <owen.omal...@gmail.com>
Subject Re: [Discussion] Base 128 variable integer encoding is not always good
Date Wed, 19 Sep 2018 22:25:23 GMT
Thanks for the sample data.

Just out of curiosity, is the natural data actually sorted like that?

I think you have a misunderstanding of RLEv2. It doesn't use LEB128 except
for the values in the header. What does RLEv1 + fixed 8 byte mean?

Based on the 512 values that you posted, I see:

512 values, min = 16430, max = 2403786, minDelta = 6, maxDelta = 78612
order = incr, bitLen = 22, deltaBitLen = 17

so the RLEv2 should have used the delta encoding. RLEv2 should have used 24
bit for each of the values in the encoding. Although with the bitLen and
deltaBitLen both between 16 and 24 bits, the delta encoding doesn't help
much. Anyways looking at what those 512 sample numbers will look like in
RLEv2:

header: 2 bytes
base: 3 bytes
firstDelta: 3 bytes
rest: 510 * 3 bytes
total: 1538 bytes

compared the direct encoding of 512 * 8 bytes = 4096 bytes. The RLEv2 is at
38% of the direct 8 byte encoding. Even if the data wasn't sorted, it
should end up with patched base with a similar size (~3 bytes/value).

Part of the reason that we don't use the odd bit sizes that are defined for
RLEv2 was precisely because zlib didn't compress well with the non-byte
aligned data. Have you tried extending the java/benchmarks with zstd to see
what happens with other data sets? I guess you could add an ORC0 option to
the benchmark to compare RLEv1 to RLEv2 under each of the compression
codecs.

This is a great conversation and I think it will be great for the RLEv3 and
ORC 2 format. In ORC 1, I don't think we should use the compression codec
as a factor to disable or change the RLE, especially based on a single data
set. I'd be more tempted to use zstd with the level set high enough that it
is useful on RLE data, since that doesn't break any old readers.

.. Owen

On Tue, Sep 18, 2018 at 8:48 PM Gang Wu <ustcwg@gmail.com> wrote:

> Owen
>   I have put the example data to reproduce the issue in
> https://github.com/facebook/zstd/issues/1325. It contains 512 unsigned
> numbers which are already zigzag-encoded using (val « 1) ^ (val » 63). The
> low overhead representation of literals is exactly what we need for RLEv3.
> We should also pay attention that zstd does not work well with LEB128 but
> zlib can get better compression ratio with LEB128. There is no one-for-all
> solution and we may come up with several optimal combinations of encoding
> and compression settings.
>
> Gopal
>   DIRECT_V2 is RLEv2 which can alleviate the issue but not resolve it. I
> will take a look at the orc.encoding.strategy setting.
>
> Thanks!
> Gang
>
> On Tue, Sep 18, 2018 at 4:08 PM Owen O'Malley <owen.omalley@gmail.com>
> wrote:
>
> > Gang,
> >    As you correctly point out, some columns don't work well with RLE.
> > Unfortunately, without being able to look at the data it is hard for me
> to
> > guess what the right compression strategies are. Based on your
> description,
> > I would guess that the data doesn't have a lot of patterns to it and
> covers
> > the majority of the 64 bit integer space. I think the best approach would
> > be to make sure that RLEv3 has a low overhead representation of literals.
> > So a literal mode something like:
> >
> > header: 2 bytes (literal, 512 values, size 64bit)
> > data: 512 * 8 bytes
> >
> > So the overhead would be roughly 2/4096 = 0.005.
> >
> > Thoughts?
> >
> > On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <gopalv@apache.org>
> > wrote:
> >
> > > Hi,
> > >
> > > >  From above observation, we find that it is better to disable LEB128
> > > encoding while zstd is used.
> > >
> > > You can enable file size optimizations (automatically recommend better
> > > layouts for compression) when
> > >
> > > "orc.encoding.strategy"="COMPRESSION"
> > >
> > > There are a bunch of bitpacking loops that's controlled by that flag
> > > already.
> > >
> > > >     https://github.com/facebook/zstd/issues/1325.
> > >
> > > If I understand that correctly, a DIRECT_V2 would also work fine for
> the
> > > numeric sequences in Zstd instead?
> > >
> > > Cheers,
> > > Gopal
> > >
> > >
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message