hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS
Date Wed, 03 Sep 2008 21:45:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628156#action_12628156
] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

Currently, Datanode stores a CRC-32 for every 512-byte chunk.  Let's call these CRCs the first
level CRC.  So the total size for the first level CRC is about 1/128 of the data size.

How about we compute a second level of checksum over the first level CRCs?  So, for every
512-byte first level CRC, we compute a CRC-32.  Then, the second level CRC is about 1/16384
of the data size.  We could use these second level CRCs as the checksum of the file.

For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the
second level CRCs is only 6.25MB.  We use these 6.25MB second level CRCs as the checksum of
the entire file.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire
input message sequentially in a central location.  HDFS supports large files with multiple
tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum
algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message