hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wei-Chiu Chuang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-10587) Incorrect offset/length calculation in pipeline recovery causes block corruption
Date Tue, 28 Jun 2016 23:19:10 GMT
Wei-Chiu Chuang created HDFS-10587:
--------------------------------------

             Summary: Incorrect offset/length calculation in pipeline recovery causes block
corruption
                 Key: HDFS-10587
                 URL: https://issues.apache.org/jira/browse/HDFS-10587
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: datanode
            Reporter: Wei-Chiu Chuang
            Assignee: Wei-Chiu Chuang


We found incorrect offset and length calculation in pipeline recovery may cause block corruption
and results in missing blocks under a very unfortunate scenario. 

(1) A client established pipeline and started writing data to the pipeline.
(2) One of the data node in the pipeline restarted, closing the socket, and some written data
were unacknowledged.
(3) Client replaced the failed data node with a new one, initiating block transfer to copy
existing data in the block to the new datanode.
(4) The block is transferred to the new node. Crucially, the entire block, including the unacknowledged
data, was transferred.
(5) The last chunk (512 bytes) was not a full chunk, but the destination still reserved the
whole chunk in its buffer, and wrote the entire buffer to disk, therefore some written data
is garbage.
(6) When the transfer was done, the destination data node converted the replica from temporary
to rbw, which made its visible length as the length of bytes on disk. That is to say, it thought
whatever was transferred was acknowledged. However, the visible length of the replica is different
(round up to the next multiple of 512) than the source of transfer.
(7) Client then truncated the block in the attempt to remove unacknowledged data. However,
because the visible length is equivalent of the bytes on disk, it did not truncate unacknowledged
data.
(8) When new data was appended to the destination, it skipped the bytes already on disk. Therefore,
whatever was written as garbage was not replaced.
(9) the volume scanner detected corrupt replica, but due to HDFS-10512, it wouldn’t tell
NameNode to mark the replica as corrupt, so the client continued to form a pipeline using
the corrupt replica.
(10) Finally the DN that had the only healthy replica was restarted. NameNode then update
the pipeline to only contain the corrupt replica.
(11) Client continue to write to the corrupt replica, because neither client nor the data
node itself knows the replica is corrupt. When the restarted datanodes comes back, their replica
are stale, despite they are not corrupt. Therefore, none of the replica is good and up to
date.

The sequence of events was reconstructed based on DataNode/NameNode log and my understanding
of code.
Incidentally, we have observed the same sequence of events on two independent clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Mime
View raw message