hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "M. C. Srivas (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2699) Store data and checksums together in block file
Date Mon, 19 Dec 2011 01:48:31 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171997#comment-13171997
] 

M. C. Srivas commented on HDFS-2699:
------------------------------------

@dhruba:

>> a block size of 4096 is too large for the CRC

>the hbase block size is 16K. The hdfs checksum size is 4K. The hdfs block size is 256
MB. which one r u referring to >here? Can you pl explain the read-modify-write cycle? HDFS
does mostly large sequential writes (no overwrites).

The CRC block size. (that is, the contiguous region of the file that a CRC covers).  Modifying
any portion of that region will require that the entire data for the region be read in, and
the CRC recomputed for that entire region and the entire region written out again.


Note that it also introduces a new failure mode ... data that was previously written safely
a long time ago could be now deemed "corrupt" since the CRC is no-longer good due to a minor
modification during an append. The failure scenario is as follows:

1. A thread writes to a file and closes it. Lets say the file length is 9K.  There are 3 CRCs
embedded inline -- one for 0-4K, one for 4K-8K, and one for 8K-9K. Call the last one CRC3.

2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed
for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is
a crash, and the entire 3K is not all written out cleanly (CRC3-new and some data in written
out before the crash -- all 3 copies crash and recover).

3. A subsequent read on the region 8K-9K now fails with a CRC error ... even though the write
was stable and used to succeed before.

If this file was the HBase WAL, wouldn't this result in a data loss?


                
> Store data and checksums together in block file
> -----------------------------------------------
>
>                 Key: HDFS-2699
>                 URL: https://issues.apache.org/jira/browse/HDFS-2699
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> The current implementation of HDFS stores the data in one block file and the metadata(checksum)
in another block file. This means that every read from HDFS actually consumes two disk iops,
one to the datafile and one to the checksum file. This is a major problem for scaling HBase,
because HBase is usually  bottlenecked on the number of random disk iops that the storage-hardware
offers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message