hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gopal Vijayaraghavan <gop...@apache.org>
Subject Re: Why does ORC use Deflater instead of native ZlibCompressor?
Date Fri, 24 Jun 2016 02:04:41 GMT
> Though, I'm also wondering about about performance difference between
>the two. Since they both use native implementations, theoretically they
>can be close in performance.

ZlibCompressor block compression was extremely slow due to the non-JNI
bits in Hadoop - <https://issues.apache.org/jira/browse/HADOOP-10681>

When I last benchmarked after that issue was fixed 86% of CPU samples were
spent inside zlib.so in the perf traces - irrespective of which mode it
was used.

The result of those profiles went into making ORC fit into Zlib better,
avoid doing compression work twice - ORC did its own versions of
dictionary+rle+bit-packing already.

<http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller-494
81231/22>

For instance, bit-packing 127 bit data into 7 bits and then compressing it
offered less compression (& cost more CPU) than leaving it at 8 bits
without reduction. LZ77 worked much better and the huffman anyway
compressed the data by bit-packing anyway. The impact was more visible at
higher bit-counts (like 27 bits is way worse than 32 bits).

And then turning off bits of Zlib not necessary for some encoding patterns
- Z_FILTERED for instance for numeric sequences, Z_TEXT for the string
dicts etc.

Purely from a performance standpoint, I'm getting more interested in Zstd,
because it brings a whole new way of fast bit-packing.

<https://issues.apache.org/jira/browse/ORC-45>


Cheers,
Gopal



Mime
View raw message