hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen O'Donnell (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-13157) Do Not Remove Blocks Sequentially During Decommission
Date Tue, 03 Sep 2019 10:59:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16921331#comment-16921331
] 

Stephen O'Donnell edited comment on HDFS-13157 at 9/3/19 10:58 AM:
-------------------------------------------------------------------

{quote}

Add a configuration, which makes NN to release the lock every 10000(configurable) blocks.

 \{quote}

There was some discussion in related to this in HDFS-10477, and they decided to drop the lock
after processing each storage. The reason, is that the iterator for the storage could get
a ConcurrentModificationException if its contents change when the lock is dropped and retaken.
Locking at the storage level is probably a good middle ground between how it works currently
locking on a block count threshold.

--Thinking about the problem on replicating older blocks first ... We currently have several
replication queues, and blocks with only 1 replica should go into the highest priority queue.
That means other blocks (only 2 replicas) and decommissioning blocks are in the 'normal' queue.
Looking at how that queue is currently processed, it begins at the start and:
 # Gets 2 * live_nodes blocks
 # Attempts to schedule them for replication based on max-streams limits
 # Any that are not scheduled are simply dropped until all other blocks have been tried and
the iterator cycles round.

Therefore even in the current implementation some of the blocks can get left behind for some
time.

This does seem to be a tricky problem to get correct, as there are quite a few edge cases
and scenarios to consider.


was (Author: sodonnell):
{quote}
 # Add a configuration, which makes NN to release the lock every 10000(configurable) blocks.

 \{quote}

There was some discussion in related to this in HDFS-10477, and they decided to drop the lock
after processing each storage. The reason, is that the iterator for the storage could get
a ConcurrentModificationException if its contents change when the lock is dropped and retaken.
Locking at the storage level is probably a good middle ground between how it works currently
locking on a block count threshold.

--Thinking about the problem on replicating older blocks first ... We currently have several
replication queues, and blocks with only 1 replica should go into the highest priority queue.
That means other blocks (only 2 replicas) and decommissioning blocks are in the 'normal' queue.
Looking at how that queue is currently processed, it begins at the start and:
 # Gets 2 * live_nodes blocks
 # Attempts to schedule them for replication based on max-streams limits
 # Any that are not scheduled are simply dropped until all other blocks have been tried and
the iterator cycles round.

Therefore even in the current implementation some of the blocks can get left behind for some
time.

This does seem to be a tricky problem to get correct, as there are quite a few edge cases
and scenarios to consider.

> Do Not Remove Blocks Sequentially During Decommission 
> ------------------------------------------------------
>
>                 Key: HDFS-13157
>                 URL: https://issues.apache.org/jira/browse/HDFS-13157
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode, namenode
>    Affects Versions: 3.0.0
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>         Attachments: HDFS-13157.1.patch
>
>
> From what I understand of [DataNode decommissioning|https://github.com/apache/hadoop/blob/42a1c98597e6dba2e371510a6b2b6b1fb94e4090/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java] it
appears that all the blocks are scheduled for removal _in order._. I'm not 100% sure what
the ordering is exactly, but I think it loops through each data volume and schedules each
block to be replicated elsewhere. The net affect is that during a decommission, all of the
DataNode transfer threads slam on a single volume until it is cleaned out. At which point,
they all slam on the next volume, etc.
> Please randomize the block list so that there is a more even distribution across all
volumes when decommissioning a node.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message