hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8239) Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
Date Mon, 20 Aug 2012 16:19:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437980#comment-13437980

Kihwal Lee commented on HADOOP-8239:

I think adding a new class is a good idea. Since DFS.getFileChcksum is expected to return
MD5MD5CRC32FileChecksum in a lot of places, subclassing MD5MD5CRC32FileChecksum for each variant
could work.

We can regard "CRC32" in MD5MD5CRC32FileChecksum as a generic term for any 32 bit CRC algorithms.
At least that is the case in current 2.0/trunk. If we go with this, subclassing MD5MD5CRC32FileChecksum
for each variant makes sense.

The following is what I am thinking:

*In MD5MD5CRC32FileChecksum*

The constructor sets crcType to DataChecksum.Type.CRC32

 * getAlgorithmName() will use it to construct the name
private DataChecksum.Type getCrcType() {
  return crcType;

public ChecksumOpt getChecksumOpt() {
  rethrn new ChecksumOpt(getCrcType(), bytesPerCrc);

*Subclass MD5MD5CRC32GzipFileChecksum*
 The constructor sets crcType to DataChecksum.Type.CRC32
*Subclass MD5MD5CRC32CastagnoliFileChecksum*
 The constructor sets crcType to DataChecksum.Type.CRC32C

*Interoperability & compatibility*
- Any existing user/hadoop code that expects MD5MD5CRC32FileChecksum from DFS.getFileChecksum()
will continue to work.
- Any new code that makes use of the new getChecksumOpt() will work as long as DFSClient#getFileChecksum()
creates and returns the right object. This will be done in HDFS-3177, and without it, every
thing will default to CRC32, which is the current behavior of branch-2/trunk.
- A newer client calling getFileChecksum() to an old cluster over hftp or webhdfs will work.
(always CRC32)
- An older client calling getFileChecksum() to newer cluster - If the remote file on the newer
cluster is in CRC32, both hftp and webhdfs work.  If CRC32C or anything else, hftp will have
a cheksum mismatch. In webhdfs, it will get an algorithm field that won't match anything the
old MD5MD5CRC32FileChecksum can create. In WebHdfsFileSystem, it will generate an IOException,
"Algorithm not matched:....".

I think this is reasonable. What do you think?
> Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
> --------------------------------------------------------------------------
>                 Key: HADOOP-8239
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8239
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 2.1.0-alpha
>         Attachments: hadoop-8239-after-hadoop-8240.patch.txt, hadoop-8239-after-hadoop-8240.patch.txt,
hadoop-8239-before-hadoop-8240.patch.txt, hadoop-8239-before-hadoop-8240.patch.txt
> In order to support HADOOP-8060, MD5MD5CRC32FileChecksum needs to be extended to carry
the information on the actual checksum type being used. The interoperability between the extended
version and branch-1 should be guaranteed when Filesystem.getFileChecksum() is called over
hftp, webhdfs or httpfs.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message