hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8239) Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
Date Mon, 20 Aug 2012 16:19:38 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437980#comment-13437980
] 

Kihwal Lee commented on HADOOP-8239:
------------------------------------

I think adding a new class is a good idea. Since DFS.getFileChcksum is expected to return
MD5MD5CRC32FileChecksum in a lot of places, subclassing MD5MD5CRC32FileChecksum for each variant
could work.

We can regard "CRC32" in MD5MD5CRC32FileChecksum as a generic term for any 32 bit CRC algorithms.
At least that is the case in current 2.0/trunk. If we go with this, subclassing MD5MD5CRC32FileChecksum
for each variant makes sense.

The following is what I am thinking:

*In MD5MD5CRC32FileChecksum*

The constructor sets crcType to DataChecksum.Type.CRC32

{code}
/** 
 * getAlgorithmName() will use it to construct the name
 */ 
private DataChecksum.Type getCrcType() {
  return crcType;
}

public ChecksumOpt getChecksumOpt() {
  rethrn new ChecksumOpt(getCrcType(), bytesPerCrc);
}
{code}

*Subclass MD5MD5CRC32GzipFileChecksum*
 The constructor sets crcType to DataChecksum.Type.CRC32
 
*Subclass MD5MD5CRC32CastagnoliFileChecksum*
 The constructor sets crcType to DataChecksum.Type.CRC32C

*Interoperability & compatibility*
- Any existing user/hadoop code that expects MD5MD5CRC32FileChecksum from DFS.getFileChecksum()
will continue to work.
- Any new code that makes use of the new getChecksumOpt() will work as long as DFSClient#getFileChecksum()
creates and returns the right object. This will be done in HDFS-3177, and without it, every
thing will default to CRC32, which is the current behavior of branch-2/trunk.
- A newer client calling getFileChecksum() to an old cluster over hftp or webhdfs will work.
(always CRC32)
- An older client calling getFileChecksum() to newer cluster - If the remote file on the newer
cluster is in CRC32, both hftp and webhdfs work.  If CRC32C or anything else, hftp will have
a cheksum mismatch. In webhdfs, it will get an algorithm field that won't match anything the
old MD5MD5CRC32FileChecksum can create. In WebHdfsFileSystem, it will generate an IOException,
"Algorithm not matched:....".

I think this is reasonable. What do you think?
                
> Extend MD5MD5CRC32FileChecksum to show the actual checksum type being used
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-8239
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8239
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 2.1.0-alpha
>
>         Attachments: hadoop-8239-after-hadoop-8240.patch.txt, hadoop-8239-after-hadoop-8240.patch.txt,
hadoop-8239-before-hadoop-8240.patch.txt, hadoop-8239-before-hadoop-8240.patch.txt
>
>
> In order to support HADOOP-8060, MD5MD5CRC32FileChecksum needs to be extended to carry
the information on the actual checksum type being used. The interoperability between the extended
version and branch-1 should be guaranteed when Filesystem.getFileChecksum() is called over
hftp, webhdfs or httpfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message