Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D3B4218437 for ; Tue, 21 Jul 2015 22:49:00 +0000 (UTC) Received: (qmail 71617 invoked by uid 500); 21 Jul 2015 22:42:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 71565 invoked by uid 500); 21 Jul 2015 22:42:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 71552 invoked by uid 99); 21 Jul 2015 22:42:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 22:42:05 +0000 Date: Tue, 21 Jul 2015 22:42:05 +0000 (UTC) From: "Allen Wittenauer (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (YARN-3641) NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer reassigned YARN-3641: -------------------------------------- Assignee: Allen Wittenauer (was: Junping Du) > NodeManager: stopRecoveryStore() shouldn't be skipped when exceptions happen in stopping NM's sub-services. > ----------------------------------------------------------------------------------------------------------- > > Key: YARN-3641 > URL: https://issues.apache.org/jira/browse/YARN-3641 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager, rolling upgrade > Affects Versions: 2.6.0 > Reporter: Junping Du > Assignee: Allen Wittenauer > Priority: Critical > Fix For: 2.7.1 > > Attachments: YARN-3641.patch > > > If NM' services not get stopped properly, we cannot start NM with enabling NM restart with work preserving. The exception is as following: > {noformat} > org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable > at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:175) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:217) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:507) > at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:555) > Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: IO error: lock /var/log/hadoop-yarn/nodemanager/recovery-state/yarn-nm-state/LOCK: Resource temporarily unavailable > at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) > at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) > at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) > at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:930) > at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:204) > at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > ... 5 more > 2015-05-12 00:34:45,262 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down NodeManager at c6403.ambari.apache.org/192.168.64.103 > ************************************************************/ > {noformat} > The related code is as below in NodeManager.java: > {code} > @Override > protected void serviceStop() throws Exception { > if (isStopping.getAndSet(true)) { > return; > } > super.serviceStop(); > stopRecoveryStore(); > DefaultMetricsSystem.shutdown(); > } > {code} > We can see we stop all NM registered services (NodeStatusUpdater, LogAggregationService, ResourceLocalizationService, etc.) first. Any of services get stopped with exception could cause stopRecoveryStore() get skipped which means levelDB store is not get closed. So next time NM start, it will get failed with exception above. > We should put stopRecoveryStore(); in a finally block. -- This message was sent by Atlassian JIRA (v6.3.4#6332)