Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 3 Jun 2015 13:22:38 +0000 (UTC)
From: "Sunil G (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12834468.1433218928000.51.1433337758630@Atlassian.JIRA>
In-Reply-To: <JIRA.12834468.1433218928000@Atlassian.JIRA>
References: <JIRA.12834468.1433218928000@Atlassian.JIRA>
 <JIRA.12834468.1433218928807@arcas>
Subject: [jira] [Commented] (YARN-3754) Race condition when the NodeManager
 is shutting down and container is launched
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570790#comment-14570790 ] 

Sunil G commented on YARN-3754:
-------------------------------

I have got the logs from [~bibinchundatt] offline.

{noformat}
2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_e313_1432908361253_4506_01_000001 and exit code: 0
java.io.IOException: java.lang.InterruptedException
...
...
2015-05-30 01:11:16,179 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update diagnostics in state store for container_e313_1432908361253_4506_01_000001
java.io.IOException: org.iq80.leveldb.DBException: Closed
	at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostic
{noformat}

When NM is shutting down, ContainerLaunch is also interrupted. During this interrupted exception handling, NM tries to update container diagnostics. But from main thread statestore is down ,hence caused the DB Close exception.

This scenario is handled in YARN-3641 already by [~djp] . [~bibinchundatt] could you please update this patch and check this and we can close this ticket as duplicate. Attaching NM logs too.


> Race condition when the NodeManager is shutting down and container is launched
> ------------------------------------------------------------------------------
>
>                 Key: YARN-3754
>                 URL: https://issues.apache.org/jira/browse/YARN-3754
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>         Environment: Suse 11 Sp3
>            Reporter: Bibin A Chundatt
>            Assignee: Sunil G
>            Priority: Critical
>
> Container is launched and returned to ContainerImpl
> NodeManager closed the DB connection which resulting in {{org.iq80.leveldb.DBException: Closed}}. 
> *Attaching the exception trace*
> {code}
> 2015-05-30 02:11:49,122 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Unable to update state store diagnostics for container_e310_1432817693365_3338_01_000002
> java.io.IOException: org.iq80.leveldb.DBException: Closed
>         at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:261)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1109)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ContainerDiagnosticsUpdateTransition.transition(ContainerImpl.java:1101)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:1129)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:246)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>         at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.iq80.leveldb.DBException: Closed
>         at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:123)
>         at org.fusesource.leveldbjni.internal.JniDB.put(JniDB.java:106)
>         at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.storeContainerDiagnostics(NMLeveldbStateStoreService.java:259)
>         ... 15 more
> {code}
> we can add a check whether DB is closed while we move container from ACQUIRED state.
> As per the discussion in YARN-3585 have add the same


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)