hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-7128) Decommission slows way down when it gets towards the end
Date Tue, 23 Sep 2014 15:01:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144865#comment-14144865
] 

Kihwal Lee edited comment on HDFS-7128 at 9/23/14 2:59 PM:
-----------------------------------------------------------

This is not just about decommissioning.  If nodes die and a large number of blocks need to
replicated, the replication monitor can schedule a large number of blocks in one run and it
can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}}
is not updated.  As you pointed out, gross over-scheduling should be avoided as it causes
replication timeout and potentially duplicate replications and invalidations.  In my experience,
multiple node deaths are commonly caused by DNS or network outages.  When it causes a big
cluster to lose a large proportion of nodes, the recovery can be very slow because almost
every nodes are over-scheduled with replication works that are no longer necessary.  This
patch will also help in this case.

I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}}
in a finally block surrounding {{chooseTarget()}}.  Do you think the default soft-limit and
hard-limit are reasonable?



was (Author: kihwal):
This is not just about decommissioning.  If nodes die and a large number of blocks need to
replicated, the replication monitor can schedule a large number of blocks in one run and it
can over-schedule far beyond the hard limit on certain nodes, since {{getNumberOfBlocksToBeReplicated()}}
is not updated.  As you pointed out, gross over-scheduling should be avoided as it causes
replication timeout and potentially duplicate replication work and invalidation.  In my experience,
multiple node deaths are commonly caused by DNS or network outages.  When it causes a big
cluster to lose a large proportion of nodes, the recovery can be very slow because almost
every nodes are over-scheduled with replication works that are no longer necessary.  This
patch will also help in this case.

I think the proposed approach is reasonable. If I were to change one thing, I would call {{decrementPendingReplicationWithoutTargets()}}
in a finally block surrounding {{chooseTarget()}}.  Do you think the default soft-limit and
hard-limit are reasonable?


> Decommission slows way down when it gets towards the end
> --------------------------------------------------------
>
>                 Key: HDFS-7128
>                 URL: https://issues.apache.org/jira/browse/HDFS-7128
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Ming Ma
>            Assignee: Ming Ma
>         Attachments: HDFS-7128.patch
>
>
> When we decommission nodes across different racks, the decommission process becomes really
slow at the end, hardly making any progress. The problem is some blocks are on 3 decomm-in-progress
DNs and the way how replications are scheduled caused unnecessary delay. Here is the analysis.
> When BlockManager schedules the replication work from neededReplication, it first needs
to pick the source node for replication via chooseSourceDatanode. The core policies to pick
the source node are:
> 1. Prefer decomm-in-progress node.
> 2. Only pick the nodes whose outstanding replication counts are below thresholds dfs.namenode.replication.max-streams
or dfs.namenode.replication.max-streams-hard-limit, based on the replication priority.
> When we decommission nodes,
> 1. All the decommission nodes' blocks will be added to neededReplication.
> 2. BM will pick X number of blocks from neededReplication in each iteration. X is based
on cluster size and some configurable multiplier. So if the cluster has 2000 nodes, X will
be around 4000.
> 3. Given these 4000 nodes are on the same decomm-in-progress node A, A end up being chosen
as the source node of all these 4000 nodes. The reason the outstanding replication thresholds
don't kick is due to the implementation of BlockManager.computeReplicationWorkForBlocks; node.getNumberOfBlocksToBeReplicated()
remains zero given node.addBlockToBeReplicated is called after source node iteration.
> {noformat}
> ...
>       synchronized (neededReplications) {
>         for (int priority = 0; priority < blocksToReplicate.size(); priority++) {
> ...
> chooseSourceDatanode
> ...
>         }
>       for(ReplicationWork rw : work){
> ...
>           rw.srcNode.addBlockToBeReplicated(block, targets);
> ...
>       }
> {noformat}
>  
> 4. So several decomm-in-progress nodes A, B, C end up with 4000 node.getNumberOfBlocksToBeReplicated().
> 5. If we assume each node can replicate 5 blocks per minutes, it is going to take 800
minutes to finish replication of these blocks.
> 6. Pending replication timeout kick in after 5 minutes. The items will be removed from
the pending replication queue and added back to neededReplication. The replications will then
be handled by other source nodes of these blocks. But the blocks still remain in nodes A,
B, C's pending replication queue, DatanodeDescriptor.replicateBlocks, so A, B, C continue
the replications of these blocks, although these blocks might have been replicated by other
DNs after replication timeout.
> 7. Some block' replicas exist on A, B, C and it is at the end of A's pending replication
queue. Even though the block's replication timeout, no source node can be chosen given A,
B, C all have high pending replication count. So we have to wait until A drains its pending
replication queue. Meanwhile, the items in A's pending replication queue have been taken care
of by other nodes and no longer under replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message