hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2699) Store data and checksums together in block file
Date Mon, 19 Dec 2011 02:38:30 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172006#comment-13172006

Todd Lipcon commented on HDFS-2699:

bq. Modifying any portion of that region will require that the entire data for the region
be read in, and the CRC recomputed for that entire region and the entire region written out

But the cost of random-reading 4K is essentially the same as the cost of reading 512 bytes.
Once you seek to the offset, the data transfer time is insignificant.

Plus, given the 4KB page size used by Linux, all IO is already at this granularity.

bq. An append happens a few days later to extend the file from 9K to 11K. CRC3 is now recomputed
for the 3K-sized region spanning offsets 8K-11K and written out as CRC3-new. But there is
a crash...

This is an existing issue regardless of whether the checksums are interleaved or separate.
The current solution is that we allow a checksum error on the last "checksum chunk" of a file
in the case that it's being recovered after a crash -- iirc only in the case that _all_ replicas
have this issue. If there is any valid replica, then we use that and truncate/rollback the
other files to the sync boundary.

> Store data and checksums together in block file
> -----------------------------------------------
>                 Key: HDFS-2699
>                 URL: https://issues.apache.org/jira/browse/HDFS-2699
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> The current implementation of HDFS stores the data in one block file and the metadata(checksum)
in another block file. This means that every read from HDFS actually consumes two disk iops,
one to the datafile and one to the checksum file. This is a major problem for scaling HBase,
because HBase is usually  bottlenecked on the number of random disk iops that the storage-hardware

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message