hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen O'Donnell (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14861) Reset LowRedundancyBlocks Iterator periodically
Date Fri, 20 Sep 2019 17:36:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934602#comment-16934602

Stephen O'Donnell commented on HDFS-14861:

The simplest way to fix this problem may be to increment a counter each time chooseLowRedundancyBlocks()
is called, and reset it each time the iterators are reset. If they have not been reset for
X calls, force a reset:
synchronized List<List<BlockInfo>> chooseLowRedundancyBlocks(
    int blocksToProcess) {
  final List<List<BlockInfo>> blocksToReconstruct = new ArrayList<>(LEVEL);

  int count = 0;
  int priority = 0;
  for (; count < blocksToProcess && priority < LEVEL; priority++) {
    if (priority == QUEUE_WITH_CORRUPT_BLOCKS) {
      // do not choose corrupted blocks.

    // Go through all blocks that need reconstructions with current priority.
    // Set the iterator to the first unprocessed block at this priority level
    final Iterator<BlockInfo> i = priorityQueues.get(priority).getBookmark();
    final List<BlockInfo> blocks = new LinkedList<>();
    // Loop through all remaining blocks in the list.
    for(; count < blocksToProcess && i.hasNext(); count++) {
  callCount++;  //>>>>  New Counter
  if (priority == LEVEL || callCount > threshold) { //>>>> Check counter against
some threshold here
    callCount = 0
    // Reset all bookmarks because there were no recently added blocks.
    for (LightWeightLinkedSet<BlockInfo> q : priorityQueues) {

  return blocksToReconstruct;
} {code}
If things are working well, then most or all blocks returned by this method should be scheduled
on datanodes, and hence the iterator bookmark should be close to the head of the list. Resetting
it would only cause a few blocks to be retried.

If things are not working well, then resetting the iterator back to the head of the list would
cause a lot of blocks to be retried and hence it would take longer to reach the tail of the
list. However that would probably indicate there are problems on the cluster (eg unable to
place new replicas, or out of service replicas).

Provided the time between resets is not too small (eg 30 - 60 minutes) this would probably
be OK.

If blocks are under-replicated (eg from a node failure), skipped blocks are not a problem
- all blocks have to be processed eventually anyway, it does not really matter what order
it happens in, or what is skipped.

However with decommissioning and maintenance mode, a skipped block can prevent the node from
completing the process. Consider decommissioning a few nodes, and one has relatively few blocks.
A skipped block on the smaller node, would cause it to wait with only a few blocks pending
until the other two nodes are fully processed and the iterator is reset.


> Reset LowRedundancyBlocks Iterator periodically
> -----------------------------------------------
>                 Key: HDFS-14861
>                 URL: https://issues.apache.org/jira/browse/HDFS-14861
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 3.3.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
> When the namenode needs to schedule blocks for reconstruction, the blocks are placed
into the neededReconstruction object in the BlockManager. This is an instance of LowRedundancyBlocks,
which maintains a list of priority queues where the blocks are held until they are scheduled
for reconstruction / replication.
> Every 3 seconds, by default, a number of blocks are retrieved from LowRedundancyBlocks.
The method LowRedundancyBlocks.chooseLowRedundancyBlocks() is used to retrieve the next set
of blocks using a bookmarked iterator. Each call to this method moves the iterator forward.
The number of blocks retrieved is governed by the formula:
> number_of_live_nodes * dfs.namenode.replication.work.multiplier.per.iteration (default
> Then the namenode attempts to schedule those blocks on datanodes, but each datanode has
a limit of how many blocks can be queued against it (controlled by dfs.namenode.replication.max-streams)
so all of the retrieved blocks may not be scheduled. There may be other block availability
reasons the blocks are not scheduled too.
> As the iterator in chooseLowRedundancyBlocks() always moves forward, the blocks which
were not scheduled are not retried until the end of the queue is reached and the iterator
is reset.
> If the replication queue is very large (eg several nodes are being decommissioned) or
if blocks are being continuously added to the replication queue (eg nodes decommission using
the proposal in HDFS-14854) it may take a very long time for the iterator to be reset to the
> The result of this, could be a few blocks for a decommissioning or entering maintenance
mode node getting left behind and it taking many hours or even days for them to be retried,
and this could stop decommission completing.
> With this Jira, I would like to suggest we reset the iterator after a configurable number
of calls to chooseLowRedundancyBlocks() so any left behind blocks are retried.

This message was sent by Atlassian Jira

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message