hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8358) ResourceManager restart fail to recover due to TimelineServiceV1Publisher NPE
Date Thu, 24 May 2018 18:06:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489504#comment-16489504
] 

Jason Lowe commented on YARN-8358:
----------------------------------

This looks like a duplicate of YARN-8068.  Unfortunately that fix should have been committed
to branch-2.9 as well but was not. I'll cherry-pick that fix to branch-2 and branch-2.9.

> ResourceManager restart fail to recover due to TimelineServiceV1Publisher NPE
> -----------------------------------------------------------------------------
>
>                 Key: YARN-8358
>                 URL: https://issues.apache.org/jira/browse/YARN-8358
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.1
>         Environment: Ubuntu 16.04
> java version "1.8.0_91"
>            Reporter: Che Yufei
>            Priority: Major
>
> I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine for 2.7.3,
but fails on 2.9.1.
> I'm using LevelDB as the RM state store, the problem seems related to TimelineServiceV1Publisher.
If I set yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery works
fine. But if the option is set to true, RM fails to start with the following log:
>  
> {{2018-05-24 23:11:54,597 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Recovery started}}
> {{2018-05-24 23:11:54,673 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Loaded RM state version info 1.1}}
> {{2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 12 RM delegation token master keys}}
> {{2018-05-24 23:11:54,688 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 0 RM delegation tokens}}
> {{2018-05-24 23:11:54,990 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 2099 applications and 2100 application attempts}}
> {{2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore:
Recovered 0 reservations}}
> {{2018-05-24 23:11:54,998 INFO org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
recovering RMDelegationTokenSecretManager.}}
> {{2018-05-24 23:11:55,003 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager:
Recovering 2099 applications}}
> {{2018-05-24 23:11:55,107 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAppManager:
Successfully recovered 0 out of 2099 applications}}
> {{2018-05-24 23:11:55,108 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Failed to load/recover state}}
> {{java.lang.NullPointerException}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)}}
> {{ at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)}}
> {{ at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)}}
> {{ at org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)}}
> {{ at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)}}
> {{ at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1183)}}
> {{ at java.security.AccessController.doPrivileged(Native Method)}}
> {{ at javax.security.auth.Subject.doAs(Subject.java:422)}}
> {{ at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1889)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1183)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1223)}}
> {{ at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
> {{ at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1422)}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message