Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 48AF8200D18 for ; Wed, 27 Sep 2017 01:13:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 46F791609EA; Tue, 26 Sep 2017 23:13:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8D12E1609D7 for ; Wed, 27 Sep 2017 01:13:05 +0200 (CEST) Received: (qmail 55511 invoked by uid 500); 26 Sep 2017 23:13:04 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 55500 invoked by uid 99); 26 Sep 2017 23:13:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Sep 2017 23:13:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 30FC51A223A for ; Tue, 26 Sep 2017 23:13:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id l_foPYMdcPNm for ; Tue, 26 Sep 2017 23:13:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id D27CD5FE4F for ; Tue, 26 Sep 2017 23:13:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id B410FE0059 for ; Tue, 26 Sep 2017 23:13:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 346E92426E for ; Tue, 26 Sep 2017 23:13:00 +0000 (UTC) Date: Tue, 26 Sep 2017 23:13:00 +0000 (UTC) From: "Jonathan Hung (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-7252) Removing queue then failing over results in exception MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 26 Sep 2017 23:13:06 -0000 [ https://issues.apache.org/jira/browse/YARN-7252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Hung updated YARN-7252: -------------------------------- Attachment: YARN-7252-YARN-5734.002.patch > Removing queue then failing over results in exception > ----------------------------------------------------- > > Key: YARN-7252 > URL: https://issues.apache.org/jira/browse/YARN-7252 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Jonathan Hung > Assignee: Jonathan Hung > Priority: Critical > Attachments: YARN-7252-YARN-5734.001.patch, YARN-7252-YARN-5734.002.patch > > > Scenario: rm1 and rm2, starting configuration with root.default, root.a. rm1 is active. First, put root.a into STOPPED state, then remove it. Then put rm1 in standby and rm2 in active. Here's the exception: {noformat}Operation failed: Error on refreshAll during transition to Active > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:315) > at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) > at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1962) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675) > Caused by: org.apache.hadoop.ha.ServiceFailedException: RefreshAll operation failed > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:747) > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:307) > ... 10 more > Caused by: java.io.IOException: Failed to re-init queues : root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:436) > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:405) > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:736) > ... 11 more > Caused by: java.io.IOException: root.a is deleted from the new capacity scheduler configuration, but the queue is not yet in stopped state. Current State : RUNNING > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateQueueHierarchy(CapacitySchedulerQueueManager.java:312) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.reinitializeQueues(CapacitySchedulerQueueManager.java:174) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:648) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:432) > ... 13 more{noformat} > Seems rm2 does not think root.a was STOPPED, so when it can't find root.a and sees it is deleted, it throws exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org