hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13849) Bzip2 java-builtin and system-native have almost the same compress speed
Date Thu, 01 Dec 2016 12:22:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15711831#comment-15711831

Steve Loughran commented on HADOOP-13849:

Well, if you want to work on it, feel free. 

however, know that the native codec uses the standard {{libbz2}}; there's not much that can
be done in the Hadoop code to speed that up other than any improvements in how data is moved
between the Java memory structures and those of libbz...if there are memory copies taking
place then that could be hurting performance. Anything that can help there would be good.

bq. I think the "system native" should have better compress/decompress performance than "java

That's something to explore. The latest Java 8 compilers are fast, and if the algorithms aren't
doing lots of object creation, then bit operations in Java should be on a par with C-language
actions against general registers. Where you would expect differences is if the native code
uses some special CPU registers and operations (example, Intel SSE2) for significant performance.
I don't know if bzip does that.

The fun part in benchmarking is isolating things. For codec performance, maybe have some test
data being pre generated in CPU & cached in RAM. in standard formats (avro, orc), and
the different codecs, then compressing that to RAM not HDD, so that the compression code is
isolated from Disk IO, etc, etc. 

If the isolated native code is faster than the java one, then the implication is that the
bottleneck is elsewhere in the workflow, not the codec. Again: that's interesting information.

bq. My hardware CPU/Memory/Network bandwidh/Disk bandwidh are not bottleneck

one of them is. Always —and it can be things like CPU cache latencies, excess synchronization
in the code, even branch-misprediction in the CPU can hurt efficiency. FWIW, Flamegraphs are
current the tool of choice for visualising performance during microbenchmarks

> Bzip2 java-builtin and system-native have almost the same compress speed
> ------------------------------------------------------------------------
>                 Key: HADOOP-13849
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13849
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 2.6.0
>         Environment: os version: redhat6
> hadoop version: 2.6.0
> native bzip2 version: bzip2-devel-1.0.5-7.el6_0.x86_64
>            Reporter: Tao Li
> I tested bzip2 java-builtin and system-native compression, and I found the compress speed
is almost the same. (I think the system-native should have better compress speed than java-builtin)
> My test case:
> 1. input file: 2.7GB text file without compression
> 2. after bzip2 java-builtin compress: 457MB, 12min 4sec
> 3. after bzip2 system-native compress: 457MB, 12min 19sec
> My MapReduce Config:
> conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false");
> conf.set("mapreduce.output.fileoutputformat.compress", "true");
> conf.set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
> conf.set("mapreduce.output.fileoutputformat.compress.codec", "org.apache.hadoop.io.compress.BZip2Codec");
> conf.set("io.compression.codec.bzip2.library", "java-builtin"); // for java-builtin
> conf.set("io.compression.codec.bzip2.library", "system-native"); // for system-native
> And I am sure I have enable the bzip2 native, the output of command "hadoop checknative
-a" is as follows:
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:    true /lib64/libz.so.1
> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
> lz4:     true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: true /usr/lib64/libcrypto.so

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message