hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS
Date Sat, 06 Sep 2008 01:10:46 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628814#action_12628814

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:

bq. Why not just use the MD5 or SHA1 of the CRCs?
MD5 requires sequential access of data.  One easy implementation of MD5-over-CRCs is that
client read all CRCs from Datanodes and then compute MD5 over them.  However, it requires
reading all the first level CRCs, which is 800MB for a 100GB file.  Is it too much network

Raghu has a very good idea for another implementation, which computes MD5 across datanodes
as follow: Client initiates the Datanode 1 (which has the first block) to compute MD5.  Datanode
1 returns the intermediate status of MD5 computation to the Client and the Client send the
intermediate states to Datanode 2 (which has the second block).  Then, the Datanode 2 continues
the MD5 computation and return the MD5 computation intermediate status to the Client, and
so on.

Note that this is not a parallel algorithm although it is a distributed algorithm.  Another
problem for MD5 in this implemenation is that there is no easy way to get the MD5 computation
intermediate status in Java 1.6.

bq. It is more appealing to have a small, fixed size checksum.
This is probably good.  I will think about this.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire
input message sequentially in a central location.  HDFS supports large files with multiple
tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum
algorithm is needed for HDFS.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message