hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Carey (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2699) Store data and checksums together in block file
Date Fri, 06 Jan 2012 22:10:40 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181643#comment-13181643

Scott Carey commented on HDFS-2699:

bq. If you want to eventually support random-IO, then a block size of 4096 is too large for
the CRC, as it will cause a read-modify-write cycle on the entire 4K. 512-bytes reduces this

With CRC hardware accelerated now, this is not a big overhead.  Without hardware acceleration
it is ~800MB/sec for 4096 byte chunks, or 200,000 blocks per second or 25% of one CPU load
at 200MB/sec writes.  With Hardware acceleration this drops by a factor of 4 to 8.  

This is besides the point, a paranoid user could configure smaller CRC chunks and test that.
 I'm suggesting that 4096 is a much saner default.

bq. Secondly, the disk manufacturers guarantee only a 512-byte atomicity on disk. Linux doing
a 4K block write guarantees almost nothing wrt atomicity of that 4K write to disk. On a crash,
unless you are running some sort of RAID or data-journal, there is a likelihood of the 4K
block that's in-flight getting corrupted.

Actually, disk manufacturers are all using 4096 byte atomicity these days (starting with 500GB
platters for most manufacturers) **.  HDFS should not target protecting power_of_two_butes
data with a checksum, but rather (power_of_two_bytes - checksum_size) data so that the hardware
atomicity (and OS page cache) lines up exactly with the hdfs checksum chunk + inlined CRC.

bq. 2. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed
for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is
a crash, and the entire 3K is not all written out cleanly 

This can be avoided entirely.
A. The OS and Hardware can avoid partial page writes.  ext4 and others can avoid partial page
writes.  The OS only flushes a page at a time.  Hardware these days writes blocks in atomic
4096 byte chunks.
B. The inlined CRC can be done so that a single 4096 byte page in the OS contains all of the
data and the crc in an atomic chunk, and the CRC and its corresponding data are therefore
not split across pages.

Under the above conditions, the performance would be excellent, and the data safety higher
than the current situation or any application level crc (unless the application is inlining
the crc to prevent splitting the data and crc across pages).

About the transition to 4096 byte blocks on Hard drives ("Advanced Format" disks):
> Store data and checksums together in block file
> -----------------------------------------------
>                 Key: HDFS-2699
>                 URL: https://issues.apache.org/jira/browse/HDFS-2699
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> The current implementation of HDFS stores the data in one block file and the metadata(checksum)
in another block file. This means that every read from HDFS actually consumes two disk iops,
one to the datafile and one to the checksum file. This is a major problem for scaling HBase,
because HBase is usually  bottlenecked on the number of random disk iops that the storage-hardware

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message