hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8206) Sending a kill does not immediately kill docker containers
Date Wed, 02 May 2018 18:27:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461463#comment-16461463
] 

Shane Kumpf commented on YARN-8206:
-----------------------------------

Thanks for the patch, [~ebadger]! I've been able to validate that it is working as intended.
Few items to address.
 # {{TestDockerContainerRuntime$MockRuntime#signalContainer}} needs updated to incorporate
the change in logic. {{TestDockerContainerRuntime#testDockerStopOnKillSignalWhenRunning}}
will need to change as well.
 # Nit: {{privilegedOperationExecutor}} could be used in {{DockerLinuxContainerRuntime#handleContainerKill}}
 # Based on the comment in the catch block for {{DockerLinuxContainerRuntime#handleContainerKill}},
I think the {{privOp}} should have failure logging disabled via {{PrivilegedOperation#disableFailureLogging}}. 
 # Along those same lines, the current code will result in a warning in the NM log when a
container completes prior to the kill. I think {{signalContainer}} will need similar logic
to {{LinuxContainerExecutor#handleExitCode}} to suppress printing the warning in this case.
c-e is returning INVALID_CONTAINER_PID (rc=9). Alternatively, it may make sense to remove
this warning all together. Here is the warning:
{code:java}
2018-05-02 17:42:35,363 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
Container container_1525282503184_0001_01_000002 transitioned from RUNNING to EXITED_WITH_SUCCESS
2018-05-02 17:42:35,363 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:
Cleaning up container container_1525282503184_0001_01_000002
2018-05-02 17:42:35,700 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
Signal docker container failed. Exception:
org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
Signal container failed
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.handleContainerKill(DockerLinuxContainerRuntime.java:1223)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.signalContainer(DockerLinuxContainerRuntime.java:995)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.signalContainer(DelegatingLinuxContainerRuntime.java:159)
	at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:755)
	at org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor$DelayedProcessKiller.run(ContainerExecutor.java:854)
{code}

 

> Sending a kill does not immediately kill docker containers
> ----------------------------------------------------------
>
>                 Key: YARN-8206
>                 URL: https://issues.apache.org/jira/browse/YARN-8206
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Eric Badger
>            Assignee: Eric Badger
>            Priority: Major
>         Attachments: YARN-8206.001.patch, YARN-8206.002.patch
>
>
> {noformat}
>         if (ContainerExecutor.Signal.KILL.equals(signal)
>             || ContainerExecutor.Signal.TERM.equals(signal)) {
>           handleContainerStop(containerId, env);
> {noformat}
> Currently in the code, we are handling both SIGKILL and SIGTERM as equivalent for docker
containers. However, they should actually be separate. When YARN sends a SIGKILL to a process,
it means for it to die immediately and not sit around waiting for anything. This ensures an
immediate reclamation of resources. Additionally, if a SIGTERM is sent before the SIGKILL,
the task might not handle the signal correctly, and will then end up as a failed task instead
of a killed task. This is especially bad for preemption. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message