hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5677) RM should transition to standby when connection is lost for an extended period
Date Wed, 12 Oct 2016 06:26:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567771#comment-15567771

Hudson commented on YARN-5677:

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10596 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/10596/])
YARN-5677. RM should transition to standby when connection is lost for (kasha: rev 6476934ae5de1be7988ab198b673d82fe0f006e3)
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMEmbeddedElector.java
* (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java

> RM should transition to standby when connection is lost for an extended period
> ------------------------------------------------------------------------------
>                 Key: YARN-5677
>                 URL: https://issues.apache.org/jira/browse/YARN-5677
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>         Attachments: YARN-5677.001.patch, YARN-5677.002.patch, YARN-5677.003.patch, YARN-5677.004.patch,
> In trunk, there is no maximum number of retries that I see.  It appears the connection
will be retried forever, with the active never figuring out it's no longer active.  In my
testing, the active-active state lasted almost 2 hours with no sign of stopping before I killed
it.  The solution appears to be to cap the number of retries or amount of time spent retrying.
> This issue is significant because of the asynchronous nature of job submission.  If the
active doesn't know it's not active, it will buffer up job submissions until it finally realizes
it has become the standby. Then it will fail all the job submissions in bulk. In high-volume
workflows, that behavior can create huge mass job failures.
> This issue is also important because the node managers will not fail over to the new
active until the old active realizes it's the standby.  Workloads submitted after the old
active loses contact with ZK will therefore fail to be executed regardless of which RM the
clients contact.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message