Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Fri, 7 Nov 2014 15:50:34 +0000 (UTC)
From: "Jason Lowe (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12753276.1415238445000.445871.1415375434075@Atlassian.JIRA>
In-Reply-To: <JIRA.12753276.1415238445000@Atlassian.JIRA>
References: <JIRA.12753276.1415238445000@Atlassian.JIRA>
 <JIRA.12753276.1415238445313@arcas>
Subject: [jira] [Commented] (YARN-2816) NM fail to start with NPE during
 container recovery
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202177#comment-14202177 ] 

Jason Lowe commented on YARN-2816:
----------------------------------

bq.  It won't cause containers leaks. Because container start request is always the first entry to store(startContainerInternal) in the levelDB for each container records and it is always the first entry to remove (removeContainer) in the levelDB for each container records.

I don't understand this statement.  If the start container request is the first record to be lost, what happens if we write the start container request, launch (or maybe don't launch) the container, then restart?  If we lost the container start record but the container had not completed before the restart, didn't we just lose track of it upon recovery?

Anyway this shouldn't make things much worse than the NM failing to start up if this specific instance of database corruption occurs.  I just think we need to realize that there are _many_ other ways the database could be corrupted and this only works around a very specific instance of it.  Comments on the patch:

{noformat}
+      LOG.info("Remove container " + containerId +
+          " with incomplete records");
{noformat}

The above needs to be logged at least at the warn level if not error.  We have very likely leaked a container.  Also the code should do much more than just forget the container and instead look for the pid file, try to kill it if found, and return a recovered container status of killed/lost or something similar.  We shouldn't just pretend the container didn't exist when returning recovered containers.

{noformat}
-        LOG.info("Creating state database at " + dbfile);
+        LOG.info("Creating state database at " + dbfile, e);
{noformat}

Why was this change made?  I don't see the point of logging the exception showing the database didn't exist when we already checked for that condition in this code path.

> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2816.000.patch, leveldb_records.txt
>
>
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the DB. When container is recovered at ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)