hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-297) Implement a pure Java CRC32 calculator
Date Wed, 08 Jul 2009 18:57:14 GMT

    [ https://issues.apache.org/jira/browse/HDFS-297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728849#action_12728849
] 

Scott Carey commented on HDFS-297:
----------------------------------

{quote}Is this new code written from scratch or has it been taken from some Apache project
(just making sure that there aren't any GPL/LGPL issues){quote}

All the versions I wrote were from scratch, modifications from Todd's first version posted.
 That one uses a slight modification of the ubiquitous one-byte-at-a-time CRC32 algorithm.
 There are at lest a few dozen variants of this algorithm available in any license you wish,
including public domain.

Less common 4 and 8 byte at a time algorithms are out there.  The one that inspired me was
Intel's "slicing-by-8" (BSD license) C implementation here http://sourceforge.net/projects/slicing-by-8
 From this IEEE paper (That I did not read, I don't have access to the publication) http://www2.computer.org/portal/web/csdl/doi/10.1109/TC.2008.85
I did not copy any of that code (or build or execute it), it was conceptual reference however.

Doing some searching on google today:
I can't find a single implementation of a four-bytes-at-a-time approach in java.
The four bytes at a time approach is not new, you can find it elsewhere, coded in C:
Sun (CDDL) 
http://www.google.com/codesearch/p?hl=en&sa=N&cd=7&ct=rc#KSEQAa6EaaE/net-virt-src-20070216/usr/src/uts/common/inet/sctp_crc32.c&q=file:crc32%20four%20bytes
A common zlib variant (at least 15 copies of this in the first few pages of 'file:crc32' search:
http://www.google.com/codesearch/p?hl=en&sa=N&cd=9&ct=rc#5Y_n4SyFm44/digest/src/crc32.c&q=file:crc32%20four%20bytes

I also just now found an implementation that uses the same same seed I did (but one byte at
a time, public domain with copyright on the source).
http://c.snippets.org/snip_lister.php?fname=crc_32.c

So, all things considered above I don't think there are intellectual property issues.  The
idea came from a description of a whitepaper (but not the actual whitepaper) from Intel that
has example BSD licensed source code in C where the actual algorithm is significantly different
from mine, but demonstrates working more than one byte at a time with multiple lookup tables.
 Four bytes at a time algorithms exist in C code under CDDL and zlib, but differ a lot from
this one.  None of the above source was copied. 
Furthermore, there is no java code like it that I can find anywhere.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HDFS-297
>                 URL: https://issues.apache.org/jira/browse/HDFS-297
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt,
hadoop-5598.txt, hadoop-5598.txt, hdfs-297.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java,
TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time
in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total
of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message