hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED
Date Mon, 15 Dec 2014 08:26:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14246442#comment-14246442
] 

Rohith commented on YARN-2340:
------------------------------

Scenario executed
# Start Yarn cluster, and submit long running application to Queue to default.Initially, RM1
is active
# *Stop the queue default* in both RM1 and RM2 using -refreshQueue. Queue can be stopped even
when application is running, but wont accept new application submissions.
# Switch the RM, let RM2 transitionedToActive. But here application recovery fails since queue
already stopped. Below logs shows the failure, but *RMAppImpl state is updated as FAILED RMAppAttempt
remain as null*. RM remain in standby
{noformat}
2014-12-15 11:01:17,813 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Recovering app: application_1418620667348_0001 with 1 attempts and final state = null
2014-12-15 11:01:17,814 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1418620667348_0001_000001 with final state: null
/////.....
/////....
2014-12-15 11:01:17,824 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Queue root.default is STOPPED. Cannot accept submission of application: application_1418620667348_0001
2014-12-15 11:01:17,825 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Failed to submit application application_1418620667348_0001 to queue default from user rohith
org.apache.hadoop.security.AccessControlException: Queue root.default is STOPPED. Cannot accept
submission of application: application_1418620667348_0001
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.submitApplication(LeafQueue.java:575)

2014-12-15 11:01:17,939 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering app attempt : appattempt_1418620667348_0001_000001
2014-12-15 11:01:17,941 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Updating application application_1418620667348_0001 with final state: FAILED
{noformat}
# After restart , Final state in RMApp=FAILED and RMAppImpl=null as shown below. RM can not
recover the applications, and continuously fails. 
{noformat}
2014-12-15 11:01:41,493 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Recovering app: application_1418620667348_0001 with 1 attempts and final state = FAILED
2014-12-15 11:01:41,494 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Recovering attempt: appattempt_1418620667348_0001_000001 with final state: null
{noformat}

> NPE thrown when RM restart after queue is STOPPED
> -------------------------------------------------
>
>                 Key: YARN-2340
>                 URL: https://issues.apache.org/jira/browse/YARN-2340
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.4.1
>         Environment: Capacityscheduler with Queue a, b
>            Reporter: Nishan Shetty
>            Assignee: Rohith
>            Priority: Critical
>
> While job is in progress make Queue  state as STOPPED and then restart RM 
> Observe that standby RM fails to come up as acive throwing below NPE
> 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1406116264351_0014_000002 State change from NEW to SUBMITTED
> 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Error in handling event type APP_ATTEMPT_ADDED to the scheduler
> java.lang.NullPointerException
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568)
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916)
>  at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101)
>  at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602)
>  at java.lang.Thread.run(Thread.java:662)
> 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Exiting, bbye..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message