hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator
Date Thu, 18 Jun 2009 00:04:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720996#action_12720996

Todd Lipcon commented on HADOOP-5598:

Scott: I just tried your version and was unable to get the same performance improvements.
I think we've established that the pure Java definitely wins on small blocks. For large blocks,
I'm seeing the following on my laptop (64-bit, with 64-bit JRE):

My most recent non-evil pure-Java: 250M/sec
Scott's patch that unrolls the loop: 260-280M/sec
Sun Java 1.6 update 14: 333M/sec
OpenJDK 1.6: 795M/sec

The OpenJDK implementation is simply wrapping zlib's crc32 routine, which must be highly optimized.
Given that we already have a JNI library for native compression using zlib, I'd like to simply
add a stub to libhadoop that wraps zlib's crc32. That should give us the same ~800M/sec throughput
for large blocks. Since we can implement the stub ourself, we also have the ability to switch
to pure Java for small sizes and get the 20x speedup with no adversarial workloads that cause
bad performance. On systems where the native code isn't available, we can simply use the pure
Java for all sizes, since at worst it's only slightly slower than java.util.Crc32 and at best
it's 30x faster.

I imagine that most production systems are using libhadoop, or at least could easily get this
deployed if it was shown to have significant performance benefits.

I'll upload a patch later this evening for this.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt,
hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time
in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total
of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message