hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-6166) Improve PureJavaCrc32
Date Fri, 14 Aug 2009 21:05:14 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Todd Lipcon updated HADOOP-6166:

    Attachment: Rplots-nehalem64.pdf

Looks like the benchmark has run long enough to get good data. Here are the benchmarks from
TestPureJavaCrc32 run on three different test systems. nehalem32 is the same nehalem box (3MB
L2 cache) running a 32-bit JVM. nehalem64 is that box with a 64-bit JVM. "laptop" is my MacBook
Pro (Core 2 duo) running a 64-bit JVM.

Each PDF has several pages:
  - The first graph shows performance over the whole byte range tested. You'll definitely
have to zoom in to be able to see anything here, and even then it's not that useful.
  - The remaining graphs show the different algorithms' performance on different sizes (same
as the tables people have been pasting into JIRA)

I ran the whole benchmark suite 50+ times to generate the error bars. Hopefully they'll serve
as a good visual indicator for where the differences are actually statistically significant.

In summary, here's how I interpret the data:

- For the 4-byte case, PureJavaCrc32 wins out on my laptop and the 32-bit JVM by a strong
margin. On the 64-bit JVM it's within 5-10% of the rest (very little difference)
- The 8-byte case is interesting - all of the 16_16* CRCs perform worse then the _8_8 CRCs.
On the 32-bit JDK it's especially obvious (nearly a factor of two)
- The 512-byte case (probably most common for DFS) - everyone is pretty much neck and neck.
The 8_8d implementation wins significantly on nehalem64, and the 8_8b wins significantly on
nehalem32. On my laptop they're all within the error bars except for 16_16 which is significantly
- The 16MB case is the same as the 512-byte case, just more pronounced. 8_8d wins on 64-bit,
8_8b on 32-bit, both by about 10%.

So, I think the next step here is to profile a couple of MR applications to see what sizes
are most common.

My personal opinion is that we should target the 64-bit Nehalem architecture and the 128-byte
size range. This would point to the 8_8d implementation as the winner.

> Improve PureJavaCrc32
> ---------------------
>                 Key: HADOOP-6166
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6166
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6166_20090722.patch, c6166_20090722_benchmark_32VM.txt, c6166_20090722_benchmark_64VM.txt,
c6166_20090727.patch, c6166_20090728.patch, c6166_20090810.patch, c6166_20090811.patch, graph.r,
graph.r, Rplots-laptop.pdf, Rplots-nehalem32.pdf, Rplots-nehalem64.pdf, Rplots.pdf, Rplots.pdf,
> Got some ideas to improve CRC32 calculation.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message