hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sam rash (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HDFS-1057) Concurrent readers hit ChecksumExceptions if following a writer to very end of file
Date Thu, 15 Apr 2010 04:05:50 GMT

    [ https://issues.apache.org/jira/browse/HDFS-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12857190#action_12857190

sam rash commented on HDFS-1057:


problem refinement/issues:

First, we can't recompute all partial chunk CRCs without a regression: blocks on disk that
are not being written to may have errors that we would miss.  This means we have to be more
careful and effectively only recompute a checksum that we can guarantee is mismatched or very
very likely to be wrong.

After trying out various ways of detecting if blocks changed and recomputing the CRC for partial
chunks, I generalized the problem to this: there is data length which is the size of the data
on disk, and metadata length which is the amount of data for which we have meta data (poor
name--need a better one).  The problem with concurrent reads/write is that these get out of

Problems that could occur or have actually occurred:

1. CRC error1: BlockSender is created when data length is > metadata length.  Produces
a crc error in that the crc in the last chunk is for less data
2. CRC error2: when reading from disk, more metadata is available than blockLength (data size
is in fact greater than when BlockSender was created)
3. EOF error: similar to 1, but in theory EOF could be be encountered when reading checksumIn
(haven't seen yet)

1. we need to guarantee when we create a BlockSender (or anything that will direct reading
of data), blockLength <= metadata length <= data length
(blockLength being what we consider available to read)
2. we detect when the block changes after we get the blockLength (create BlockSender) in order
to know if we should recompute partial chunk CRCs

In this way, if we guarantee #1, and implement #2, we can know when the CRC will be invalid
in recompute them w/o any regression 

For #1, we already have ongoing creates concept which keeps some in-memory information about
blocks being written to (new or append).  We add a 'visible length' attribute which we also
expose in FSDatasetInterface.getVisibleLength(Block b).  This is an in memory length of the
block that we update only after flushing both data and metadata to disk. For blocks not being
appended to, this function delegates to the original FSDatasetInterface.getLength(Block b)
which is unchanged. In this way, if BlockSender uses this for the blockLength, #1 above holds.

For #2, we 'memoize' some info about the block when we create it.  In particular, we get the
actual length of the block on disk and store it.  On each packet send, we compare the expected
length of our inputstream to what is actually there.  If there is more data, the block has
changed and the CRC data can't be trusted for the last partial chunk.

Test cases : inspired by Todd's (could not apply patch to 0.20), I create two threads and
have one write very small data sizes and call sync.  Another thread opens the file, reads
to the end, remembers that position, and starts off there again and resumes.  This reliably
produces the errors within a few seconds and runs 10-15s when there is no error (tunable).
 In order to detect data corruption (which I saw in some particular bugs), I made the data
written such that you can tell byte X is X % 127.  This helps catch issues that CRC recomputation
might hide.

> Concurrent readers hit ChecksumExceptions if following a writer to very end of file
> -----------------------------------------------------------------------------------
>                 Key: HDFS-1057
>                 URL: https://issues.apache.org/jira/browse/HDFS-1057
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: data-node
>    Affects Versions: 0.21.0, 0.22.0
>            Reporter: Todd Lipcon
>            Priority: Critical
> In BlockReceiver.receivePacket, it calls replicaInfo.setBytesOnDisk before calling flush().
Therefore, if there is a concurrent reader, it's possible to race here - the reader will see
the new length while those bytes are still in the buffers of BlockReceiver. Thus the client
will potentially see checksum errors or EOFs. Additionally, the last checksum chunk of the
file is made accessible to readers even though it is not stable.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message