hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Xiaoqiao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12703) Exceptions are fatal to decommissioning monitor
Date Sun, 07 Jul 2019 18:10:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879917#comment-16879917

He Xiaoqiao commented on HDFS-12703:

Upload patch [^HDFS-12703.005.patch] with unit test and try to fix this issue.
I think the root cause is that interface of DatanodeDescriptor is non-thread-safe after dig
the decommission logic. Consider that {{DatanodeAdminManager#monitor}} is running, another
thread set {{adminState}} to {{Decommissioned}} of corresponding DataNode, then this issue
will reprod.
 [^HDFS-12703.005.patch] just catch the exception and remove datanode from {{outOfServiceNodeBlocks}}
and push back to {{pendingNodes}} then it will process next loop.
Does it need a restart or another refreshNodes to take it out of the invalid state?
Since postpone to check and it will meet the proper state in next loop, so we do not need
to operation DataNode or refreshNodes again.

To [~xuel1], I just assign JIRA to myself, please feel free to assign back to you if would
like to go on working on this issue before we resolve it.

> Exceptions are fatal to decommissioning monitor
> -----------------------------------------------
>                 Key: HDFS-12703
>                 URL: https://issues.apache.org/jira/browse/HDFS-12703
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Daryn Sharp
>            Assignee: He Xiaoqiao
>            Priority: Critical
>         Attachments: HDFS-12703.001.patch, HDFS-12703.002.patch, HDFS-12703.003.patch,
HDFS-12703.004.patch, HDFS-12703.005.patch
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If an exception
occurs, all decommissioning ceases until the NN is restarted.  Per javadoc for {{executor#scheduleAtFixedRate}}:
*If any execution of the task encounters an exception, subsequent executions are suppressed*.
 The monitor thread is alive but blocked waiting for an executor task that will never come.
 The code currently disposes of the future so the actual exception that aborted the task is
> Failover is insufficient since the task is also likely dead on the standby.  Replication
queue init after the transition to active will fix the under replication of blocks on currently
decommissioning nodes but future nodes never decommission.  The standby must be bounced prior
to failover – and hopefully the error condition does not reoccur.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message