hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hung (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7252) Removing queue then failing over results in exception
Date Tue, 26 Sep 2017 23:15:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181728#comment-16181728
] 

Jonathan Hung commented on YARN-7252:
-------------------------------------

[~leftnoteasy] thanks for taking a look. Attached 002 patch.
bq. Add isConfigurationMutable to CSContext can avoid type conversion.
done
bq. Could you add a test to make sure that scheduler still validate queue hierarchy when transition
from init->active? (since we don't valid queue hierarchy when fail over, we should check
it when RM start.
I don't think validateQueueHierarchy is called on RM startup, it's only called on scheduler
refresh. (i.e. it is only called in CapacityScheduler#reinitializeQueues, but not in CapacityScheduler#initializeQueues).
It's just to validate differences between the old and new queue hierarchy, and on startup
there is no old queue hierarchy. Let me know if I'm missing something.

> Removing queue then failing over results in exception
> -----------------------------------------------------
>
>                 Key: YARN-7252
>                 URL: https://issues.apache.org/jira/browse/YARN-7252
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Jonathan Hung
>            Assignee: Jonathan Hung
>            Priority: Critical
>         Attachments: YARN-7252-YARN-5734.001.patch, YARN-7252-YARN-5734.002.patch
>
>
> Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is active.
First, put root.a into STOPPED state, then remove it. Then put rm1 in standby and rm2 in active.
Here's the exception: {noformat}Operation failed: Error on refreshAll during transition to
Active
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315)
> 	at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107)
> 	at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)
> Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747)
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307)
> 	... 10 more
> Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted from the
new capacity scheduler configuration, but the queue is not yet in stopped state. Current State
: RUNNING
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436)
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405)
> 	at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736)
> 	... 11 more
> Caused by: java.io.IOException: root.a is deleted from the new capacity scheduler configuration,
but the queue is not yet in stopped state. Current State : RUNNING
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432)
> 	... 13 more{noformat}
> Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and sees it
is deleted, it throws exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message