hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Íñigo Goiri (JIRA) <j...@apache.org>
Subject [jira] [Commented] (HDFS-12703) Exceptions are fatal to decommissioning monitor
Date Mon, 08 Jul 2019 21:53:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880732#comment-16880732

Íñigo Goiri commented on HDFS-12703:

Tested it without the fix and it throws:
{{DatanodeAdminManager#monitor does not swallow exceptions.}}
So we are good there.

One minor thing is that we should do an {{assertTrue()}} when checking {{anyThreadMatching()}}.

> Exceptions are fatal to decommissioning monitor
> -----------------------------------------------
>                 Key: HDFS-12703
>                 URL: https://issues.apache.org/jira/browse/HDFS-12703
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Daryn Sharp
>            Assignee: He Xiaoqiao
>            Priority: Critical
>         Attachments: HDFS-12703.001.patch, HDFS-12703.002.patch, HDFS-12703.003.patch,
HDFS-12703.004.patch, HDFS-12703.005.patch, HDFS-12703.006.patch, HDFS-12703.007.patch, HDFS-12703.008.patch
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If an exception
occurs, all decommissioning ceases until the NN is restarted.  Per javadoc for {{executor#scheduleAtFixedRate}}:
*If any execution of the task encounters an exception, subsequent executions are suppressed*.
 The monitor thread is alive but blocked waiting for an executor task that will never come.
 The code currently disposes of the future so the actual exception that aborted the task is
> Failover is insufficient since the task is also likely dead on the standby.  Replication
queue init after the transition to active will fix the under replication of blocks on currently
decommissioning nodes but future nodes never decommission.  The standby must be bounced prior
to failover – and hopefully the error condition does not reoccur.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message