hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Angadi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1134) Block level CRCs in HDFS
Date Mon, 26 Mar 2007 21:55:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12484240

Raghu Angadi commented on HADOOP-1134:

Outline of requirements and decisions made:

I will include more details on some of these points as the implementation or discussion continues.
I think I can start implementation of some of the parts.
Here DFS refers to Namenode(NN) and Datanodes(DNs), and client refers to 
1) Checksums are maintained end to end and maintained in datanodes as metadata for each block
file.  4 byte CRC32 is calculated for for configure size of of sub-block  data. Default is
4 byte CRC for each 64KB of block data.
2) DN keeps checksums for each blocks as separate files (e.g.: blk_<id>.crc) and includes
a header at the front of the file. 
3) Checksums are calculated by client while writing and passed on to DN. DN verifies before
writing to disk. DN verifies the checksum each time it reads the block data to serve to a
client and client verifies it as well. 
4) Data transfer protocol between client and datanodes includes inline checksums transmitted
along with the data on same TCP connection. Client reads from a different replica when checksum
fails from a DN. 
5) When DN notices a checksum failure, it informs namenode. Namenode will treat this as a
deleted block in initial implementation. Later improvement will delay delation of the block
until a new valid replica is created. 
6) DistributedFileSystem class will not extend ChecksumFileSystem since checksums will be
integral to DFS. We could have a ChecksumDistributedFileSystem if weneed user visible checksums.

7) Upgrade : When DFS is upgraded to this new version, DFS cluster will in safemode until
all (or most of) the datanodes upgrade thier local files with checksums. This process is expected
to last for couple of hours. 
8) Currently each DFS file has associated checksum stored in ".crc" file. During upgrade,
datanodes fetch relevant parts of .crc files to verify checksums for each block. Since this
involves interaction with namenode, namenode could be busy or even bottleneck for upgrade.

9) We haven't decided how and when to delete .crc files. They could be deleted by a shell
script as well.

Future enhancements : 
1) Bechmark CPU and I/O overhead of checksums on Datanodes. Most of CPU overhead could be
hidden with overlapping with network and disk I/O. Tests showed that java CRC32 takes 5-6
micro seconds for each 1MB of data. Because of the overlap I don't expect any noticeable increase
in latency because of CPU. Disk I/O overhead might contribute more for latency. 
2) Based on benchmark tests, investigate if in-memory cache can be used for CRC.
3) Datanodes should periodically scan and verify checksums for its blocks. 
4) Option to change CRC-block size. E.g. during the upgrade, datanodes maintain 4 byte CRC
for every 512 bytes since ".crc" files used 512 byte chunks. We might want to convert them
to 4 bytes for every 64K bytes of data. 

5) Namenode should delay deletion of a corrupted block until a new replica is created.

> Block level CRCs in HDFS
> ------------------------
>                 Key: HADOOP-1134
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1134
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Raghu Angadi
>         Assigned To: Raghu Angadi
> Currently CRCs are handled at FileSystem level and are transparent to core HDFS. See
recent improvement HADOOP-928 ( that can add checksums to a given filesystem ) regd more about
it. Though this served us well there a few disadvantages :
> 1) This doubles namespace in HDFS ( or other filesystem implementations ). In many cases,
it nearly doubles the number of blocks. Taking namenode out of CRCs would nearly double namespace
performance both in terms of CPU and memory.
> 2) Since CRCs are transparent to HDFS, it can not actively detect corrupted blocks. With
block level CRCs, Datanode can periodically verify the checksums and report corruptions to
namnode such that name replicas can be created.
> We propose to have CRCs maintained for all HDFS data in much the same way as in GFS.
I will update the jira with detailed requirements and design. This will include same guarantees
provided by current implementation and will include a upgrade of current data.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message