hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4759) Revisit signalContainer() for docker containers
Date Wed, 23 Mar 2016 15:35:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15208603#comment-15208603

Shane Kumpf commented on YARN-4759:

We need to use docker client commands to signal to processes in containers versus the OS kill

docker stop sends a SIGTERM to PID 1 and waits 10 seconds for the process to stop (by default,
configurable), if the container hasn't stopped at the end of the timeout, SIGKILL is sent.
docker kill, OTOH, has no delay and simply sends SIGKILL to PID 1 of the container (by default,
signal configurable).

Signals that invoke graceful shutdown vary between processes. For instance to gracefully shutdown
nginx (allowing outstanding requests to finish) SIGQUIT should be sent. For Apache HTTPD,
SIGWINCH is used for graceful shutdown. 

To complicate matters, the docker client sends signals PID 1 in the container, so depending
on if exec form is used for CMD in the Dockerfile, the process we want to send the signal
to may be a subprocess of the shell running as PID 1. User's that require specific signals
will need to properly understand this limitation.

We should allow for user configurable signals and timeouts. There are a couple of approaches
to achieve this:

1) Only use docker kill and sleep in Java code. Docker kill accepts the --signal argument,
but does not support a wait timeout. The flow would be: send signal, sleep 10 seconds by default
 or the user supplied sleep value.

2) Use docker stop if the user has not specified a signal. Use the default of 10 seconds for
the timeout or the user supplied timeout. Use docker kill if the user supplies a signal.

The default behavior should be to send a SIGTERM, sleep 10 seconds, if still running, send
SIGKILL. Signal and timeouts should be configurable.

How the above impacts NM reacquistion is yet to be determined, but it may make sense to make
this an umbrella to split out the required changes.

/cc [~sidharta-s] - thoughts on the above?

> Revisit signalContainer() for docker containers
> -----------------------------------------------
>                 Key: YARN-4759
>                 URL: https://issues.apache.org/jira/browse/YARN-4759
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn
>            Reporter: Sidharta Seethana
>            Assignee: Shane Kumpf
> The current signal handling (in the DockerContainerRuntime) needs to be revisited for
docker containers. For example, container reacquisition on NM restart might not work, depending
on which user the process in the container runs as. 

This message was sent by Atlassian JIRA

View raw message