hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4549) Containers stuck in KILLING state
Date Thu, 07 Jan 2016 15:10:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087548#comment-15087548

Jason Lowe commented on YARN-4549:

If this only happens to long-running containers and the pid files are missing for even RUNNING
containers that have been up a while then I'm thinking something is coming along at some point
and blowing away the pid files because they're too old.  Is there a tmp cleaner like tmpwatch
or some other periodic maintenance process that could be cleaning up these "old" files?  A
while back someone reported NM recovery issues because they were storing the NM leveldb state
store files in /tmp and a tmp cleaner was periodically deleting some of the old leveldb files
and corrupting the database.

You could also look in other areas under nmPrivate and see if some of the distributed cache
directories have also been removed.  If that's the case then you should see messages like
"Resource XXX is missing, localizing it again" in the NM logs as it tries to re-use a distcache
entry but then discovers it's mysteriously missing from the local disk.  If whole directories
have been reaped including the dist cache entries then it would strongly point to something
like a periodic cleanup like tmpwatch or something similar.

> Containers stuck in KILLING state
> ---------------------------------
>                 Key: YARN-4549
>                 URL: https://issues.apache.org/jira/browse/YARN-4549
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the container-executor
with cgroups configuration. Also we have NM recovery enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the NM tries
to kill them. The container remains running indefinitely, this causes some duplication as
new containers are brought up to replace them. Looking through the logs NM can't seem to get
the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping container with
container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user IP=    
   OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS
 APPID=application_1448454866800_0023    CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container container_1448454866800_0023_01_000005
transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for container_1448454866800_0023_01_000005.
Waited for 2000 ms.
> {noformat}
> The PID files for containers in the KILLING state are missing, and a few other container
that have been in the RUNNING state for a few weeks are also missing them.  We waren't able
to consistently replicate this and hoping that someone has come across this before.

This message was sent by Atlassian JIRA

View raw message