orc-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Wu <gan...@apache.org>
Subject [Discussion] Base 128 variable integer encoding is not always good
Date Tue, 18 Sep 2018 20:41:21 GMT

We are using zstd as the default compressor in production for ORC. Overall
the performance is very good. Through our analysis, there is some room of
improvement for integers.

As we know, all integers use base 128 varint encoding (a.k.a LEB128) after
RLE. This works well for zlib and other compressors. However, when we use
zstd, LEB128-encoded data leads to worse result than fixed 64-bit int64_t.
I have created an issue in zstd community and get confirmed:

To provide some data, we have an ORC file with 10 columns (4 long types and
6 string types). All 4 long columns do not fit for RLE very well, meaning
that most of them are literals in the RLE output. The overall size for
different settings are as below:

   - RLEv1 + LEB128: 8991617 bytes
   - RLEv2 + LEB128: 8305585 bytes
   - RLEv1 + fixed 64-bit: 7961360 bytes

I tried to analyze the one column of the file and got the following result:

   - RLEv1 + zstd + LEB128: 1188651 bytes
   - RLEv1 + zstd + fixed 64-bit: 685522 bytes
   - RLEv1 + zlib + LEB128: 834729 bytes
   - RLEv1 + zlib + fixed 64-bit: 854529 bytes

>From above observation, we find that it is better to disable LEB128
encoding while zstd is used. This can be easily achieved by bumping the
file version. Any thoughts?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message