hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS
Date Sat, 06 Sep 2008 01:18:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628817#action_12628817

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:

bq. When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()?
Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to
compute them from the CRC's. Even if they are to be pre-computed, then we might still use
the CRCs, to reduce FileSystem upgrade time.

It is better to compute file checksum on-demand, so that the Datanode storage layout remains
unchanged and we won't have to do distributed upgrade.

bq. My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol
to add a getChecksum() operation that returns the checksum for a block without transmitting
the CRCs to the client, and the client combines block checksums to get a whole-file checksum.
This is rather expensive, but still a lot faster than checksumming the entire file on demand.

My idea is similar to this except that we should not compute block checksum.  Otherwise the
file checksum computed depends on the block size.  That is the reason that I propose to compute
the second level CRCs over the first level CRCs.  This idea is borrowed from hash tree (aka
Merkle Trees), which is used by ZFS.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire
input message sequentially in a central location.  HDFS supports large files with multiple
tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum
algorithm is needed for HDFS.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message