hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing
Date Wed, 10 May 2017 18:08:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16005127#comment-16005127
] 

Nathan Roberts commented on HDFS-11755:
---------------------------------------

The failing unit tests in trunk have been unstable in precommit:
org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testMultipleVolFailuresOnNode
org.apache.hadoop.hdfs.TestDFSRSDefault10x4StripedOutputStreamWithFailure.testMultipleDatanodeFailure56
The timed out test TestLeaseRecovery2 does not fail locally and has also been unstable across
multiple precommit runs on this jira.


> Underconstruction blocks can be considered missing
> --------------------------------------------------
>
>                 Key: HDFS-11755
>                 URL: https://issues.apache.org/jira/browse/HDFS-11755
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha2, 2.8.1
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: HDFS-11755.001.patch, HDFS-11755.002.patch, HDFS-11755-branch-2.002.patch,
HDFS-11755-branch-2.8.002.patch
>
>
> Following sequence of events can lead to a block underconstruction being considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It does eventually
succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so DN1 notices
and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline and notifies
NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block missing because
there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message