hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danil Serdyuchenko (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4549) Containers stuck in KILLING state
Date Thu, 07 Jan 2016 11:14:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087217#comment-15087217

Danil Serdyuchenko commented on YARN-4549:

We did some more digging and found that a few containers that are currently in RUNNING state,
are missing directories under {{nmPrivate}} dir. The web interface reports that the containers
are running on that node and the container processes are there too, but we are missing the
the entire application dir under {{nmPrivate}}.

[~jlowe] This usually happens to long running containers. The PID files are missing for containers
in KILLING state, and for certain RUNNING containers. The pid file should be under {{nm-local-dir}},
for us it's: {{/tmp/hadoop-ec2-user/nm-local-dir/nmPrivate/<application_id>/<container_id>/<container_id>.pid}}.

> Containers stuck in KILLING state
> ---------------------------------
>                 Key: YARN-4549
>                 URL: https://issues.apache.org/jira/browse/YARN-4549
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
> We are running samza 0.8 on YARN 2.7.1 with {{LinuxContainerExecutor}} as the container-executor
with cgroups configuration. Also we have NM recovery enabled.
> We observe a lot of containers that get stuck in the KIILLING state after the NM tries
to kill them. The container remains running indefinitely, this causes some duplication as
new containers are brought up to replace them. Looking through the logs NM can't seem to get
the container PID.
> {noformat}
> 16/01/05 05:16:44 INFO containermanager.ContainerManagerImpl: Stopping container with
container Id: container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO nodemanager.NMAuditLogger: USER=ec2-user IP=    
   OPERATION=Stop Container Request        TARGET=ContainerManageImpl      RESULT=SUCCESS
 APPID=application_1448454866800_0023    CONTAINERID=container_1448454866800_0023_01_000005
> 16/01/05 05:16:44 INFO container.ContainerImpl: Container container_1448454866800_0023_01_000005
transitioned from RUNNING to KILLING
> 16/01/05 05:16:44 INFO launcher.ContainerLaunch: Cleaning up container container_1448454866800_0023_01_000005
> 16/01/05 05:16:47 INFO launcher.ContainerLaunch: Could not get pid for container_1448454866800_0023_01_000005.
Waited for 2000 ms.
> {noformat}
> The PID files for each container seem to be present on the node. We waren't able to consistently
replicate this and hoping that someone has come across this before.

This message was sent by Atlassian JIRA

View raw message