hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4605) Implement block-size independent file checksum
Date Fri, 15 Mar 2013 16:54:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603509#comment-13603509

Kihwal Lee commented on HDFS-4605:

>From MAPREDUCE-5065,

bq. Most hashing is incremental, so if DFSClient feeds the last state of hash into the next
datanode and let it continue updating it, the result will be independent of block size. The
current way of doing file checksum allows calculating individual block checksums in parallel,
but we are not taking advantage of it in DFSClient anyway. So I don't think there will be
any significant changes in performance or overhead.

I think this will work as long as 
* no partial blocks in the middle.
* block size is multiple of crc chunk/block size.
As far as I know these are enforced in HDFS.

Assuming this can be done, what will be the best way to add this feature? 
> Implement block-size independent file checksum
> ----------------------------------------------
>                 Key: HDFS-4605
>                 URL: https://issues.apache.org/jira/browse/HDFS-4605
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs-client
>    Affects Versions: 3.0.0
>            Reporter: Kihwal Lee
> The value of current getFileChecksum() is block-size dependent. Since FileChecksum is
mainly intended for comparing content of files, removing this dependency will make FileCheckum
in HDFS relevant in more use cases.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message