hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-2010) If RM fails to recover an app, it can never transition to active again
Date Wed, 22 Oct 2014 23:50:35 GMT

     [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Karthik Kambatla updated YARN-2010:
-----------------------------------
    Description: 
Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.



  was:
If the RM fails to recover an app attempt, it won't come up. We should make it more resilient.

Specifically, the underlying error is that the app was submitted before Kerberos security
got turned on. Makes sense for the app to fail in this case. But YARN should still start.

{noformat}
2014-04-11 11:56:37,216 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling
the winning of election 
org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active 
at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:118)

at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:804) 
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:415)

at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) 
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) 
Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active
mode 
at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:274)

at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:116)

... 4 more 
Caused by: org.apache.hadoop.service.ServiceStateException: org.apache.hadoop.yarn.exceptions.YarnException:
java.lang.IllegalArgumentException: Missing argument 
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)

at org.apache.hadoop.service.AbstractService.start(AbstractService.java:204) 
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:811)

at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:842)

at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:265)

... 5 more 
Caused by: org.apache.hadoop.yarn.exceptions.YarnException: java.lang.IllegalArgumentException:
Missing argument 
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:372)

at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:273)

at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:406)

at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1000)

at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:462)

at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) 
... 8 more 
Caused by: java.lang.IllegalArgumentException: Missing argument 
at javax.crypto.spec.SecretKeySpec.<init>(SecretKeySpec.java:93) 
at org.apache.hadoop.security.token.SecretManager.createSecretKey(SecretManager.java:188)

at org.apache.hadoop.yarn.server.resourcemanager.security.ClientToAMTokenSecretManagerInRM.registerMasterKey(ClientToAMTokenSecretManagerInRM.java:49)

at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recoverAppAttemptCredentials(RMAppAttemptImpl.java:711)

at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:689)

at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:663)

at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:369)

... 13 more 
{noformat}


> If RM fails to recover an app, it can never transition to active again
> ----------------------------------------------------------------------
>
>                 Key: YARN-2010
>                 URL: https://issues.apache.org/jira/browse/YARN-2010
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.3.0
>            Reporter: bc Wong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: YARN-2010.1.patch, YARN-2010.patch, yarn-2010-2.patch, yarn-2010-3.patch,
yarn-2010-3.patch, yarn-2010-4.patch
>
>
> Sometimes, the RM fails to recover an application. It could be because of turning security
on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1)
transient, (2) specific to one application, and (3) permanent and apply to multiple (all)
applications. Today, the RM fails to transition to Active and ends up in STOPPED state and
can never be transitioned to Active again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message