hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS
Date Mon, 08 Sep 2008 17:17:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629217#action_12629217

Doug Cutting commented on HADOOP-3981:

> Otherwise the file checksum computed depends on the block size.

It still depends on bytes.per.checksum, which can vary per file, just like block size.  If
two files have different bytes.per.checksum then we should not compare CRC-derived checksums.
 Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes
could be an algorithm name.  If we compute these per-block, then the algorithm name would
be MD5-of-CRC32-every-512bytes-with-64Mblocks.

If we compute checksums on demand from CRCs then it will be relatively slow.  Distcp thus
needs to be sure to only get checksums when lengths match and the alternative is copying the
entire file.  So long as distcp is the primary client of checksums this is probably sufficient
and we should not bother storing checksums.

Another API to consider might be:
  - String[] getChecksumAlgorithms(Path)
  - Checksum getChecksum(Path)

This way an HDFS filesystem might return ["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes",
"MD5"] the possible algorithms for a file in preferred order.  Then Distcp could call this
for two files (whose lengths match) to see if they have any compatible algorithms.  If possible,
CRC's would be combined on datanodes, but, if block sizes differ, the CRCs could be summed
in the client.  If the CRCs are incompatible, then MD5s could be computed on datanodes.  Is
this overkill?  Probably.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire
input message sequentially in a central location.  HDFS supports large files with multiple
tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum
algorithm is needed for HDFS.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message