hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4401) A failed app recovery should not prevent the RM from starting
Date Tue, 01 Dec 2015 19:17:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034399#comment-15034399

Daniel Templeton commented on YARN-4401:

There are lots of reasons a recovery could fail.  For example, if a job is stored with a resource
allocation that is higher than the configured maximum at the time of recovery, the recovery
will throw an exception which will prevent the RM from starting.

In a single RM configuration, it makes some sense to allow the RM restart to be interrupted
by recovery failure, but in an HA scenario, the standby in becoming active to prevent an outage.
 Causing an outage over a bad application is undermining the point of HA.  It becomes a question
of trading an application failure for a service outage.  I think most sites would choose the

There's already yarn.fail-fast and yarn.resourcemanager.fail-fast that control this behavior
for some of the recovery failure scenarios, such as bad queue assignments.  I would propose
we extend the meaning of those properties to cover the full range of what could go wrong during

> A failed app recovery should not prevent the RM from starting
> -------------------------------------------------------------
>                 Key: YARN-4401
>                 URL: https://issues.apache.org/jira/browse/YARN-4401
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 2.7.1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
> There are many different reasons why an app recovery could fail with an exception, causing
the RM start to be aborted.  If that happens the RM will fail to start.  Presumably, the reason
the RM is trying to do a recovery is that it's the standby trying to fill in for the active.
 Failing to come up defeats the purpose of the HA configuration.  Instead of preventing the
RM from starting, a failed app recovery should log an error and skip the application.

This message was sent by Atlassian JIRA

View raw message