hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jing Zhao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6178) Decommission on standby NN couldn't finish
Date Tue, 01 Apr 2014 21:25:19 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13957028#comment-13957028
] 

Jing Zhao commented on HDFS-6178:
---------------------------------

I guess we can let people only run refreshNodes on ANN. In that case, we may have the following:
# If the decommission finishes before any NN failover, the scenarios in the description can
happen, i.e., SBN may have made inconsistent decision and keep trying the decommission. However,
its commands will be ignored by DNs. And when failover happens, since the original SBN will
clear all the replication queue and cached DN commands, finally this NN will generate the
replication queues based on the correct information.
# If NN failover happens during the the decommission (the replication for the decommission
is still on-going), still, the original SBN will clear all the replication queues and re-initialize
them based on incoming block reports. Then if we run the refreshNodes on this NN, the NN may
achieve a correct decision.
# We may want to disable the replication monitor for the SBN so that it will not try to send
replication/invalidate commands to DNs.

> Decommission on standby NN couldn't finish
> ------------------------------------------
>
>                 Key: HDFS-6178
>                 URL: https://issues.apache.org/jira/browse/HDFS-6178
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>            Reporter: Ming Ma
>
> Currently decommissioning machines in HA-enabled cluster requires running refreshNodes
in both active and standby nodes. Sometimes decommissioning won't finish from standby NN's
point of view.  Here is the diagnosis of why it could happen.
> Standby NN's blockManager manages blocks replication and block invalidation as if it
is the active NN; even though DNs will ignore block commands coming from standby NN. When
standby NN makes block operation decisions such as the target of block replication and the
node to remove excess blocks from, the decision is independent of active NN. So active NN
and standby NN could have different states. When we try to decommission nodes on standby nodes;
such state inconsistency might prevent standby NN from making progress. Here is an example.
> Machine A
> Machine B
> Machine C
> Machine D
> Machine E
> Machine F
> Machine G
> Machine H
> 1. For a given block, both active and standby have 5 replicas on machine A, B, C, D,
E. So both active and standby decide to pick excess nodes to invalidate.
> Active picked D and E as excess DNs. After the next block reports from D and E, active
NN has 3 active replicas (A, B, C), 0 excess replica.
> {noformat}
> 2014-03-27 01:50:14,410 INFO BlockStateChange: BLOCK* chooseExcessReplicates: (E:50010,
blk_-5207804474559026159_121186764) is added to invalidated blocks set
> 2014-03-27 01:50:15,539 INFO BlockStateChange: BLOCK* chooseExcessReplicates: (D:50010,
blk_-5207804474559026159_121186764) is added to invalidated blocks set
> {noformat}
> Standby pick C, E as excess DNs. Given DNs ignore commands from standby, After the next
block reports from C, D, E,  standby has 2 active replicas (A, B), 1 excess replica (C).
> {noformat}
> 2014-03-27 01:51:49,543 INFO BlockStateChange: BLOCK* chooseExcessReplicates: (E:50010,
blk_-5207804474559026159_121186764) is added to invalidated blocks set
> 2014-03-27 01:51:49,894 INFO BlockStateChange: BLOCK* chooseExcessReplicates: (C:50010,
blk_-5207804474559026159_121186764) is added to invalidated blocks set
> {noformat}
> 2. Machine A decomm request was sent to standby. Standby only had one live replica and
picked machine G, H as targets, but given standby commands was ignored by DNs, G, H remained
in pending replication queue until they are timed out. At this point, you have one decommissioning
replica (A), 1 active replica (B), one excess replica (C).
> {noformat}
> 2014-03-27 04:42:52,258 INFO BlockStateChange: BLOCK* ask A:50010 to replicate blk_-5207804474559026159_121186764
to datanode(s) G:50010 H:50010
> {noformat}
> 3. Machine A decomm request was sent to active NN. Active NN picked machine F as the
target. It finished properly. So active NN had 3 active replicas (B, C, F), one decommissioned
replica (A).
> {noformat}
> 2014-03-27 04:44:15,239 INFO BlockStateChange: BLOCK* ask 10.42.246.110:50010 to replicate
blk_-5207804474559026159_121186764 to datanode(s) F:50010
> 2014-03-27 04:44:16,083 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
F:50010 is added to blk_-5207804474559026159_121186764 size 7100065
> {noformat}
> 4. Standby NN picked up F as a new replica. Thus standby had one decommissioning replica
(A), 2 active replicas (B, F), one excess replica (C). Standby NN kept trying to schedule
replication work, but DNs ignored the commands.
> {noformat}
> 2014-03-27 04:44:16,084 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated:
F:50010 is added to blk_-5207804474559026159_121186764 size 7100065
> 2014-03-28 23:06:11,970 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Block: blk_-5207804474559026159_121186764, Expected Replicas: 3, live replicas: 2, corrupt
replicas: 0, decommissioned replicas: 1, excess replicas: 1, Is Open File: false, Datanodes
having this block: C:50010 B:50010 A:50010 F:50010 , Current Datanode: A:50010, Is current
datanode decommissioning: true
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message