hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhijie Shen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
Date Wed, 13 Aug 2014 00:02:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14094924#comment-14094924
] 

Zhijie Shen commented on YARN-2308:
-----------------------------------

Investigated into the problem: when submitting the app to a non-existing queue, the app is
going to be rejected by CS. It works fine in a normal submission, because addAppAttempt happens
after RMApp enters ACCEPTED, when addApp has already been executed successfully. However,
in the recover mode, addAppAttempt is triggered independent of the result of addApp. At this
moment, app doesn't exist in CS as it has been rejected, while addAppAttempt assumes it should
exist, and result in NPE.

The fix makes sense to more. Some additional comments:

bq. + conf.setBoolean(YarnConfiguration.RM_WORK_PRESERVING_RECOVERY_ENABLED, true);

It should be true to imitate the failure case in the description, right? According AttemptRecoveredTransition,
if isWorkPreservingRecoveryEnabled = true, AppAttemptAddedSchedulerEvent will not scheduled.
However, whether AppAttemptAddedSchedulerEvent is scheduled or not, the app should get rejected
finally, shouldn't it? What was the test failure when isWorkPreservingRecoveryEnabled = false?

> NPE happened when RM restart after CapacityScheduler queue configuration changed 
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-2308
>                 URL: https://issues.apache.org/jira/browse/YARN-2308
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.6.0
>            Reporter: Wangda Tan
>            Assignee: chang li
>            Priority: Critical
>         Attachments: jira2308.patch, jira2308.patch, jira2308.patch
>
>
> I encountered a NPE when RM restart
> {code}
> 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
>         at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:744)
> {code}
> And RM will be failed to restart.
> This is caused by queue configuration changed, I removed some queues and added new queues.
So when RM restarts, it tries to recover history applications, and when any of queues of these
applications removed, NPE will be raised.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message