hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11755) Underconstruction blocks can be considered missing
Date Tue, 09 May 2017 20:19:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003459#comment-16003459
] 

Nathan Roberts commented on HDFS-11755:
---------------------------------------

bq. Do you know which one makes more sense?
Not an expert in this area but here's my understanding. When a block is completed and the
client has received the necessary acks, the client either adds another block, or completes
the file. Both cause the namenode to consider the block complete, and at that point the namenode
will properly maintain replication of the completed block. If the pipeline fails while writing,
the client may (depends on policy configured) rebuild the pipeline to maintain the desired
level of replication in the pipeline. So, while a block is mutating, it is the client that
is ultimately responsible for making sure enough datanodes remain in the pipeline and in-sync
with the data. Once a block is complete, it becomes the namenode's responsibility to maintain
replication. 

If a client dies and fails to complete the last block, after a timeout, lease recovery will
cause the file to be closed and the blocks to be properly synchronized and committed if possible.
 

There is also hsync(), which applications can use to enhance the durability guarantees at
the datanode (via fsync).

Hope that helps a little.


> Underconstruction blocks can be considered missing
> --------------------------------------------------
>
>                 Key: HDFS-11755
>                 URL: https://issues.apache.org/jira/browse/HDFS-11755
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.0.0-alpha2, 2.8.1
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: HDFS-11755.001.patch
>
>
> Following sequence of events can lead to a block underconstruction being considered missing.
> - pipeline of 3 DNs, DN1->DN2->DN3
> - DN3 has a failing disk so some updates take a long time
> - Client writes entire block and is waiting for final ack
> - DN1, DN2 and DN3 have all received the block 
> - DN1 is waiting for ACK from DN2 who is waiting for ACK from DN3
> - DN3 is having trouble finalizing the block due to the failing drive. It does eventually
succeed but it is VERY slow at doing so. 
> - DN2 times out waiting for DN3 and tears down its pieces of the pipeline, so DN1 notices
and does the same. Neither DN1 nor DN2 finalized the block.
> - DN3 finally sends an IBR to the NN indicating the block has been received.
> - Drive containing the block on DN3 fails enough that the DN takes it offline and notifies
NN of failed volume
> - NN removes DN3's replica from the triplets and then declares the block missing because
there are no other replicas
> Seems like we shouldn't consider uncompleted blocks for replication.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message