hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS
Date Fri, 05 Sep 2008 20:54:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628743#action_12628743

Doug Cutting commented on HADOOP-3981:

> We use these 6.25MB second level CRCs as the checksum of the entire file.

Why not just use the MD5 or SHA1 of the CRCs?

When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()?
Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to
compute them from the CRC's.  Even if they are to be pre-computed, then we might still use
the CRCs, to reduce FileSystem upgrade time.

If checksums were pre-computed, where would they be stored?  We could store them in the NameNode,
with file metadata, or we could store per-block checksums on datanodes.

My hunch is that we should compute them on demand from CRC data.  We extend ClientDatanodeProtocol
to add a getChecksum() operation that returns the checksum for a block without transmitting
the CRCs to the client, and the client combines block checksums to get a whole-file checksum.
 This is rather expensive, but still a lot faster than checksumming the entire file on demand.
 DistCp would be substantially faster if it only used checksums when file lengths match, so
we should probably make that optimization.

Longer-term we could think about a checksum API that permits a sequence of checksums to be
returned per file, so that, e.g., if a source file has been appended to, we could truncate
the destination and append the new data, incrementally updating it.  But until HDFS supports
truncation this is moot.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire
input message sequentially in a central location.  HDFS supports large files with multiple
tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum
algorithm is needed for HDFS.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message