hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2308) NPE happened when RM restart after CapacityScheduler queue configuration changed
Date Wed, 13 Aug 2014 02:16:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14095050#comment-14095050

Wangda Tan commented on YARN-2308:

bq. I think we should catch exception in following code and return Failed directly.
Currently the CapacityScheduler will create AppRejectedEvent when found queue not existed
while recovering or submit.
    if (queue == null) {
      String message = "Application " + applicationId + 
      " submitted by user " + user + " to unknown queue: " + queueName;
          .handle(new RMAppRejectedEvent(applicationId, message));
We cannot catch exception here, because now exception throw:
      // Add application to scheduler synchronously to guarantee scheduler
      // knows applications before AM or NM re-registers.
      app.scheduler.handle(new AppAddedSchedulerEvent(app.applicationId,
        app.submissionContext.getQueue(), app.user, true));

bq. That's what I meant. RMApp can choose to enter FAILED state directly and no need to add
attempt any more.
It will not add attempt here, because it will get rejected directly

bq. RM_WORK_PRESERVING_RECOVERY_ENABLED=true reflects the failure case in the description,
but I'm wondering why RM_WORK_PRESERVING_RECOVERY_ENABLED=false, the test is going to fail.
App will anyway be rejected, won't it?
I've tried this in my local again, it can get passed. Set RM_WORK_PRESERVING_RECOVERY_ENABLED=false
is enough to cover what we want to verify.

> NPE happened when RM restart after CapacityScheduler queue configuration changed 
> ---------------------------------------------------------------------------------
>                 Key: YARN-2308
>                 URL: https://issues.apache.org/jira/browse/YARN-2308
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.6.0
>            Reporter: Wangda Tan
>            Assignee: chang li
>            Priority: Critical
>         Attachments: jira2308.patch, jira2308.patch, jira2308.patch
> I encountered a NPE when RM restart
> {code}
> 2014-07-16 07:22:46,957 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:566)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:922)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:98)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:594)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:654)
>         at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:85)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:698)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationEventDispatcher.handle(ResourceManager.java:682)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:744)
> {code}
> And RM will be failed to restart.
> This is caused by queue configuration changed, I removed some queues and added new queues.
So when RM restarts, it tries to recover history applications, and when any of queues of these
applications removed, NPE will be raised.

This message was sent by Atlassian JIRA

View raw message