Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1BF9617E0D for ; Wed, 3 Jun 2015 13:22:39 +0000 (UTC) Received: (qmail 68452 invoked by uid 500); 3 Jun 2015 13:22:38 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 68370 invoked by uid 500); 3 Jun 2015 13:22:38 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 68354 invoked by uid 99); 3 Jun 2015 13:22:38 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jun 2015 13:22:38 +0000 Date: Wed, 3 Jun 2015 13:22:38 +0000 (UTC) From: "Sunil G (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3754) Race condition when the NodeManager is shutting down and container is launched MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570790#comment-14570790 ] Sunil G commented on YARN-3754: ------------------------------- I have got the logs from [~bibinchundatt] offline. {noformat} 2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e313_1432908361253_4506_01_000001 and exit code: 0 java.io.IOException: java.lang.InterruptedException ... ... 2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update diagnostics in state store for container_e313_1432908361253_4506_01_000001 java.io.IOException: org.iq80.leveldb.DBException: Closed at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostic {noformat} When NM is shutting down, ContainerLaunch is also interrupted. During this interrupted exception handling, NM tries to update container diagnostics. But from main thread statestore is down ,hence caused the DB Close exception. This scenario is handled in YARN-3641 already by [~djp] . [~bibinchundatt] could you please update this patch and check this and we can close this ticket as duplicate. Attaching NM logs too. > Race condition when the NodeManager is shutting down and container is launched > ------------------------------------------------------------------------------ > > Key: YARN-3754 > URL: https://issues.apache.org/jira/browse/YARN-3754 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Environment: Suse 11 Sp3 > Reporter: Bibin A Chundatt > Assignee: Sunil G > Priority: Critical > > Container is launched and returned to ContainerImpl > NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. > *Attaching the exception trace* > {code} > 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_000002 > java.io.IOException: org.iq80.leveldb.DBException: Closed > at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101) > at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83) > at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.iq80.leveldb.DBException: Closed > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123) > at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106) > at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259) > ... 15 more > {code} > we can add a check whether DB is closed while we move container from ACQUIRED state. > As per the discussion in YARN-3585 have add the same -- This message was sent by Atlassian JIRA (v6.3.4#6332)