hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Marc (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-7554) Checksumming is implementation specific
Date Fri, 19 Dec 2014 17:00:17 GMT
Colin Marc created HDFS-7554:

             Summary: Checksumming is implementation specific
                 Key: HDFS-7554
                 URL: https://issues.apache.org/jira/browse/HDFS-7554
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: dfsclient
    Affects Versions: 2.5.0
            Reporter: Colin Marc
            Priority: Minor

The code that calculates checksums of files in DFSClient is implementation specific. That
is to say, the checksums should be consistent constant as long as you use the same code, but
the algorithm isn't particularly stable or portable.

In DFSClient.java, when each individual checksum is received for a block, those checksums
are written out to a DataOutputBuffer: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2173

Then the checksum is calculated by digesting all the data from that buffer: https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L2231

However, that buffer is (reasonably) automatically padded with zeroes to the next power of
two, those zeroes are included in the checksum.

This effectively means that the checksum algorithm is dependent on the behavior of DataOutputBuffer,
which is a bit surprising, and could change in the future. It would be much more stable, not
to mention memory efficient, if the final hash was simply updated with each block checksum,
rather than buffering them all and then digesting that.

This message was sent by Atlassian JIRA

View raw message