From yarn-issues-return-28470-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Wed May 28 17:37:03 2014 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B18EB10156 for ; Wed, 28 May 2014 17:37:03 +0000 (UTC) Received: (qmail 97683 invoked by uid 500); 28 May 2014 17:37:03 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 97651 invoked by uid 500); 28 May 2014 17:37:03 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 97643 invoked by uid 99); 28 May 2014 17:37:03 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 May 2014 17:37:03 +0000 Date: Wed, 28 May 2014 17:37:03 +0000 (UTC) From: "Jian He (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2010) RM can't transition to active if it can't recover an app attempt MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14011355#comment-14011355 ] Jian He commented on YARN-2010: ------------------------------- bq. The stack trace corresponds to non-work-preserving restart. I am not sure I understand the concern. What I meant is, in this scenario, it shouldn't matter whether the old attempt has the master key or not, since the old attempt will be anyways killed by NM on RM restart. The newly started attempt will have the proper master key generated. If we just check whether the key is null and move on, the next attempt should be able to succeed. So we don't need to explicitly fail the app ? > RM can't transition to active if it can't recover an app attempt > ---------------------------------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Rohith > Priority: Critical > Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch > > > If the RM fails to recover an app attempt, it won't come up. We should make it more resilient. > Specifically, the underlying error is that the app was submitted before Kerberos security got turned on. Makes sense for the app to fail in this case. But YARN should still start. > {noformat} > 2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election > org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active > at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118) > at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) > at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) > Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274) > at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116) > ... 4 more > Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument > at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842) > at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265) > ... 5 more > Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException: Missing argument > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 8 more > Caused by: java.lang.IllegalArgumentException: Missing argument > at javax.crypto.spec.SecretKeySpec.(SecretKeySpec.java:93) > at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188) > at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369) > ... 13 more > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)