hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghu Angadi <rang...@yahoo-inc.com>
Subject Re: Many Checksum Errors
Date Wed, 16 May 2007 18:39:04 GMT
Doug Cutting wrote:
> [ Moving discussion to hadoop-dev.  -drc ]
> 
> Raghu Angadi wrote:
>> This is good validation how important ECC memory is. Currently HDFS 
>> client deletes a block when it notices a checksum error. After moving 
>> to Block level CRCs soon, we should make Datanode re-validate the 
>> block before deciding to delete it.
> 
> It also emphasizes how important end-to-end checksums are.  Data should 
> also be checksummed as soon as possible after it is generated, before it 
> has a chance to be corrupted.
> 
> Ideally, the initial buffer that stores the data should be small, and 
> data should be checksummed as this initial buffer is flushed.

In my implementation of block-level CRCs (does not affect 
ChecksumFileSystem in HADOOP-928), we don't buffer checksum data at all. 
As soon as io.bytes.per.checksum are written, checksum is written 
directly to the backupstream. I have removed stream buffering in 
multiple places in DFSClient. But it this is still affected by the 
buffering issue you mentioned below.

> In the 
> current implementation, the small checksum buffer is the second buffer, 
> the initial buffer is the larger, io.buffer.size buffer.  To provide 
> maximum protection against memory errors, this situation should be 
> reversed.
> 
> This is discussed in https://issues.apache.org/jira/browse/HADOOP-928. 
> Perhaps a new issue should be filed to reverse the order of these 
> buffers, so that data is checksummed before entering the larger, 
> longer-lived buffer?

This reversal still does not help Block-level CRCs. We could remove 
buffering all together in FileSystem level and let the FS 
implementations to decide how to buffer.

Raghu.

> Doug


Mime
View raw message