hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services.
Date Thu, 14 May 2015 15:15:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543798#comment-14543798
] 

Hudson commented on YARN-3641:
------------------------------

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #195 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/195/])
YARN-3641. NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in
stopping NM's sub-services. Contributed by Junping Du (jlowe: rev 711d77cc54a64b2c3db70bdacc6bf2245c896a4b)
* hadoop-yarn-project/CHANGES.txt
* hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java


> NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping
NM's sub-services.
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3641
>                 URL: https://issues.apache.org/jira/browse/YARN-3641
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager, rolling upgrade
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>             Fix For: 2.7.1
>
>         Attachments: YARN-3641.patch
>
>
> If NM' services not get stopped properly, we cannot start NM with enabling NM restart
with work preserving. The exception is as following:
> {noformat}
> org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException:
IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource
temporarily unavailable
> 	at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555)
> Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK:
Resource temporarily unavailable
> 	at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
> 	at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
> 	at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930)
> 	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	... 5 more
> 2015-05-12 00:34:45,262 INFO  nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG:
> /************************************************************
> SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103
> ************************************************************/
> {noformat}
> The related code is as below in NodeManager.java:
> {code}
>   @Override
>   protected void serviceStop() throws Exception {
>     if (isStopping.getAndSet(true)) {
>       return;
>     }
>     super.serviceStop();
>     stopRecoveryStore();
>     DefaultMetricsSystem.shutdown();
>   }
> {code}
> We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService,
ResourceLocalizationService, etc.) first. Any of services get stopped with exception could
cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next
time NM start, it will get failed with exception above. 
> We should put stopRecoveryStore(); in a finally block.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message