hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oleksandr Shevchenko (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-7913) Improve error handling when application recovery fails with exception
Date Fri, 02 Mar 2018 11:05:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383462#comment-16383462
] 

Oleksandr Shevchenko edited comment on YARN-7913 at 3/2/18 11:04 AM:
---------------------------------------------------------------------

I faced the same issue. RM failed with NPE during failover if FairScheduler configurations
were changed.

An application was not finished yet, so, application final state = null and also, the last
app attempt doesn't have the final state too.

2018-02-28 15:50:51,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Recovering app: application_1517497680557_565955 *with 2 attempts and final state = null*
2018-02-28 15:50:54,761 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1517497680557_565955_000001 with *final state: FAILED*
2018-02-28 15:50:54,766 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1517497680557_565955_000002 with *final state: null*

 

In my case, an *ACL configuration in fair-scheduler.xml was changed* as a result we no longer
have a rights to submit this application.

In FairScheduler#addApplication() we skip it application. We do not add this application to
the scheduler application map and send event APP_REJECTED to go an application to the state
FAILED.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi)
&& !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
String msg = "User " + userUgi.getUserName() +
" cannot submit applications to queue " + queue.getName();
LOG.info(msg);
rmContext.getDispatcher().getEventHandler()
.handle(new RMAppRejectedEvent(applicationId, msg));
return;
}

{code}
 

Then we try to recovery app attempts. When we try to recovery the last app attempt we should
check the final state of attempt and the final state of the application (See RMAppAttemptImpl#transition()).
As I said before, application final state = null and also, the last app attempt doesn't have
the final state too. So, we check RM app current state in method "isAppInFinalState".

 
{code:java}
public static boolean isAppInFinalState(RMApp rmApp) {
RMAppState appState = ((RMAppImpl) rmApp).getRecoveredFinalState();
if (appState == null) {
appState = rmApp.getState();
}
return appState == RMAppState.FAILED || appState == RMAppState.FINISHED
|| appState == RMAppState.KILLED;
}

{code}
 

For now, the *current state of the application is NEW because the APP_REJECTED event has not
been processed yet* as was described by Gergo Repas. *This lead to the wrong decision to recover
attempt*. We try to get a user of the application in FairScheduler#addApplicationAttempt and
get NPE because the application nod found in the scheduler.
{code:java}
SchedulerApplication<FSAppAttempt> application =
applications.get(applicationAttemptId.getApplicationId());
String user = application.getUser(); //NPE
{code}
 

java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:740)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1327)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1100)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1046)

 

*Ideally, we should process APP_REJECTED event before we try to recovery attempts.* But for
now, I didn't find an easy way to do that.

*As a workaround, we can check whether an application is null.* If it true then skip this
attempt. The same way as in CapacityScheduler and as was proposed in YARN-2025.

Perhaps, we should open a new ticket for this.
{code:java}
SchedulerApplication<FSAppAttempt> application =
applications.get(applicationAttemptId.getApplicationId());
_if (application == null) {_
_LOG.warn("Application " + applicationAttemptId.getApplicationId() +_
_" cannot be found in scheduler.");_
_return;_
_}_
String user = application.getUser();

{code}
As a result, RM not failed now but we will get InvalidStateTransitonException because APP_REJECTED
event will be processed too late.
{noformat}
2018-02-28 16:00:24,847 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: APP_REJECTED
at ACCEPTED.*
{noformat}
If we also add transition from ACCEPTED state to FAILED to the RMAppImpl StateMachineFactory
{code:java}
.addTransition(RMAppState.ACCEPTED, RMAppState.FINAL_SAVING,
RMAppEventType.APP_REJECTED,
new FinalSavingTransition(new AppRejectedTransition(),
RMAppState.FAILED))

{code}
 

the application will be failed correctly but we get the same problem with attempt:

2018-03-01 16:26:23,899 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: ATTEMPT_FAILED
at FAILED*

 

{{Thanks for any comments.}}


was (Author: oshevchenko):
I faced the same issue. RM failed with NPE during failover if FairScheduler configurations
were changed.

An application was not finished yet, so, application final state = null and also, the last
app attempt doesn't have the final state too.

2018-02-28 15:50:51,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Recovering app: application_1517497680557_565955 *with 2 attempts and final state = null*
2018-02-28 15:50:54,761 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1517497680557_565955_000001 with *final state: FAILED*
2018-02-28 15:50:54,766 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1517497680557_565955_000002 with *final state: null*

 

In my case, an *ACL configuration in fair-scheduler.xml was changed* as a result we no longer
have a rights to submit this application.

In FairScheduler#addApplication() we skip it application. We do not add this application to
the scheduler application map and send event APP_REJECTED to go an application to the state
FAILED.
{code:java}
if (!queue.hasAccess(QueueACL.SUBMIT_APPLICATIONS, userUgi)
&& !queue.hasAccess(QueueACL.ADMINISTER_QUEUE, userUgi)) {
String msg = "User " + userUgi.getUserName() +
" cannot submit applications to queue " + queue.getName();
LOG.info(msg);
rmContext.getDispatcher().getEventHandler()
.handle(new RMAppRejectedEvent(applicationId, msg));
return;
}

{code}
 

Then we try to recovery app attempts. When we try to recovery the last app attempt we should
check the final state of attempt and the final state of the application (See RMAppAttemptImpl#transition()).
As I said before, application final state = null and also, the last app attempt doesn't have
the final state too. So, we check RM app current state in method "isAppInFinalState".

 
{code:java}
public static boolean isAppInFinalState(RMApp rmApp) {
RMAppState appState = ((RMAppImpl) rmApp).getRecoveredFinalState();
if (appState == null) {
appState = rmApp.getState();
}
return appState == RMAppState.FAILED || appState == RMAppState.FINISHED
|| appState == RMAppState.KILLED;
}

{code}
 

For now, the *current state of the application is NEW because the APP_REJECTED event has not
been processed yet* as was described by Gergo Repas. *This lead to the wrong decision to recover
attempt*. We try to get a user of the application in FairScheduler#addApplicationAttempt and
get NPE because the application nod found in the scheduler.
{code:java}
SchedulerApplication<FSAppAttempt> application =
applications.get(applicationAttemptId.getApplicationId());
String user = application.getUser(); //NPE
{code}
 

 

java.lang.NullPointerException
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:740)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1327)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:117)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1100)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AttemptRecoveredTransition.transition(RMAppAttemptImpl.java:1046)

 

*Ideally, we should process APP_REJECTED event before we try to recovery attempts.* But for
now, I didn't find an easy way to do that.

*As a workaround, we can check whether an application is null.* If it true then skip this
attempt. The same way as in CapacityScheduler and as was proposed in YARN-2025.

Perhaps, we should open a new ticket for this.
{code:java}
SchedulerApplication<FSAppAttempt> application =
applications.get(applicationAttemptId.getApplicationId());
_if (application == null) {_
_LOG.warn("Application " + applicationAttemptId.getApplicationId() +_
_" cannot be found in scheduler.");_
_return;_
_}_
String user = application.getUser();

{code}
As a result, RM not failed now but we will get InvalidStateTransitonException because APP_REJECTED
event will be processed too late.
{noformat}
2018-02-28 16:00:24,847 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: APP_REJECTED
at ACCEPTED.*
{noformat}
If we also add transition from ACCEPTED state to FAILED to the RMAppImpl StateMachineFactory
{code:java}
.addTransition(RMAppState.ACCEPTED, RMAppState.FINAL_SAVING,
RMAppEventType.APP_REJECTED,
new FinalSavingTransition(new AppRejectedTransition(),
RMAppState.FAILED))

{code}
 

the application will be failed correctly but we get the same problem with attempt:

2018-03-01 16:26:23,899 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Can't handle this event at current state
org.apache.hadoop.yarn.state.InvalidStateTransitonException: *Invalid event: ATTEMPT_FAILED
at FAILED*

 

{{Thanks for any comments.}}

> Improve error handling when application recovery fails with exception
> ---------------------------------------------------------------------
>
>                 Key: YARN-7913
>                 URL: https://issues.apache.org/jira/browse/YARN-7913
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.0.0
>            Reporter: Gergo Repas
>            Assignee: Gergo Repas
>            Priority: Major
>         Attachments: YARN-7913.000.poc.patch
>
>
> There are edge cases when the application recovery fails with an exception.
> Example failure scenario:
>  * setup: a queue is a leaf queue in the primary RM's config and the same queue is a
parent queue in the secondary RM's config.
>  * When failover happens with this setup, the recovery will fail for applications on
this queue, and an APP_REJECTED event will be dispatched to the async dispatcher. On the same
thread (that handles the recovery), a NullPointerException is thrown when the applicationAttempt
is tried to be recovered (https://github.com/apache/hadoop/blob/55066cc53dc22b68f9ca55a0029741d6c846be0a/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java#L494).
I don't see a good way to avoid the NPE in this scenario, because when the NPE occurs the
APP_REJECTED has not been processed yet, and we don't know that the application recovery failed.
> Currently the first exception will abort the recovery, and if there are X applications,
there will be ~X passive -> active RM transition attempts - the passive -> active RM
transition will only succeed when the last APP_REJECTED event is processed on the async dispatcher
thread.
> _The point of this ticket is to improve the error handling and reduce the number of passive
-> active RM transition attempts (solving the above described failure scenario isn't in
scope)._



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message