hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-5677) RM can be in active-active state for an extended period
Date Mon, 26 Sep 2016 21:48:20 GMT
Daniel Templeton created YARN-5677:

             Summary: RM can be in active-active state for an extended period
                 Key: YARN-5677
                 URL: https://issues.apache.org/jira/browse/YARN-5677
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.0.0-alpha1, 2.7.3
            Reporter: Daniel Templeton
            Assignee: Daniel Templeton
            Priority: Critical

Both branch-2.8/trunk and branch-2.7 have issues when the active RM loses contact with the
ZK node(s).

In branch-2.7, the RM will retry the connection 1000 times by default.  Attempting to contact
a node which cannot be reached is slow, which means the active can take over an hour to realize
it is no longer active.  I clocked it at about an hour and a half in my tests.  The solution
appears to be to add some time awareness into the retry loop.

In branch-2.8/trunk, there is no maximum number of retries that I see.  It appears the connection
will be retried forever, with the active never figuring out it's no longer active.  I have
a test running, and I'll update this description with empirical findings when I'm done.  The
solution appears to be to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission.  If the active
doesn't know it's not active, it will buffer up job submissions until it finally realizes
it has become the standby. Then it will fail all the job submissions in bulk. In high-volume
workflows, that behavior can create huge mass job failures.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message