hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2816) NM fail to start with NPE during container recovery
Date Fri, 07 Nov 2014 15:50:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202177#comment-14202177

Jason Lowe commented on YARN-2816:

bq.  It won't cause containers leaks. Because container start request is always the first
entry to store(startContainerInternal) in the levelDB for each container records and it is
always the first entry to remove (removeContainer) in the levelDB for each container records.

I don't understand this statement.  If the start container request is the first record to
be lost, what happens if we write the start container request, launch (or maybe don't launch)
the container, then restart?  If we lost the container start record but the container had
not completed before the restart, didn't we just lose track of it upon recovery?

Anyway this shouldn't make things much worse than the NM failing to start up if this specific
instance of database corruption occurs.  I just think we need to realize that there are _many_
other ways the database could be corrupted and this only works around a very specific instance
of it.  Comments on the patch:

+      LOG.info("Remove container " + containerId +
+          " with incomplete records");

The above needs to be logged at least at the warn level if not error.  We have very likely
leaked a container.  Also the code should do much more than just forget the container and
instead look for the pid file, try to kill it if found, and return a recovered container status
of killed/lost or something similar.  We shouldn't just pretend the container didn't exist
when returning recovered containers.

-        LOG.info("Creating state database at " + dbfile);
+        LOG.info("Creating state database at " + dbfile, e);

Why was this change made?  I don't see the point of logging the exception showing the database
didn't exist when we already checked for that condition in this code path.

> NM fail to start with NPE during container recovery
> ---------------------------------------------------
>                 Key: YARN-2816
>                 URL: https://issues.apache.org/jira/browse/YARN-2816
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-2816.000.patch, leveldb_records.txt
> NM fail to start with NPE during container recovery.
> We saw the following crash happen:
> 2014-10-30 22:22:37,211 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl
failed in state INITED; cause: java.lang.NullPointerException
> java.lang.NullPointerException
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recoverContainer(ContainerManagerImpl.java:289)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.recover(ContainerManagerImpl.java:252)
> 	at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:235)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:250)
> 	at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:445)
> 	at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:492)
> The reason is some DB files used in NMLeveldbStateStoreService are accidentally deleted
to save disk space at /tmp/hadoop-yarn/yarn-nm-recovery/yarn-nm-state. This leaves some incomplete
container record which don't have CONTAINER_REQUEST_KEY_SUFFIX(startRequest) entry in the
DB. When container is recovered at ContainerManagerImpl#recoverContainer, 
> The NullPointerException at the following code cause NM shutdown.
> {code}
>     StartContainerRequest req = rcs.getStartRequest();
>     ContainerLaunchContext launchContext = req.getContainerLaunchContext();
> {code}

This message was sent by Atlassian JIRA

View raw message