hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5366) Improve handling of the Docker container life cycle
Date Thu, 09 Nov 2017 13:13:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245605#comment-16245605
] 

Shane Kumpf commented on YARN-5366:
-----------------------------------

{quote}
However, why is it that we're running docker containers in detached mode anyway?
{quote}

Hey [~ebadger] - Sorry for the long delay. That is a great question. I've spent some time
testing and I do see the benefits in that the {{--rm}} argument simplifies the lifecycle for
most containers, but here are a few of the items I ran into.\\
\\
* Without detached mode, stdin gets attached to the process in the container. This results
in container-executor waiting for the docker run command to complete before performing the
remaining tasks, like getting the pid, setting up the cgroup tasks file. Since the container
is already finished when c-e resumes, it's not possible to get the PID. Without the PID file,
ContainersMonitor, container signaling, etc fails. This would require non-trivial changes
to container-executor, beyond swapping the flags if we take the default behavior.
* It would be nice to have the ability to keep a container around for debugging and clean
it up after the debug delay has passed. This is what YARN-5366 really started out to do.
* The {{\-\-rm}} doesn't always work. I've seen containers get stuck in a "dead" state due
to mount namespace leaks and {{\-\-rm}} left it abandoned. Explicitly managing the lifecycle
still has these challenges, but we can control removal retries if we encounter an issue. 
* It looks like for GPU support, named volumes were added. I haven't spent the time to understand
exactly how it is used, but {{\-\-rm}} will remove volumes in some cases, so we'd need to
better understand the use to avoid inadvertently deleting data.

Given these, I think managing the lifecycle explicitly might make more sense, we'll need to
do validation anyway if we went the {{\-\-rm}} route to make sure the clean up was successful.

> Improve handling of the Docker container life cycle
> ---------------------------------------------------
>
>                 Key: YARN-5366
>                 URL: https://issues.apache.org/jira/browse/YARN-5366
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn
>            Reporter: Shane Kumpf
>            Assignee: Shane Kumpf
>              Labels: oct16-medium
>         Attachments: YARN-5366.001.patch, YARN-5366.002.patch, YARN-5366.003.patch, YARN-5366.004.patch,
YARN-5366.005.patch, YARN-5366.006.patch
>
>
> There are several paths that need to be improved with regard to the Docker container
lifecycle when running Docker containers on YARN.
> 1) Provide the ability to keep a container on the NodeManager for a set period of time
for debugging purposes.
> 2) Support sending signals to the process in the container to allow for triggering stack
traces, heap dumps, etc.
> 3) Support for Docker's live restore, which means moving away from the use of {{docker
wait}}. (YARN-5818)
> 4) Improve the resiliency of liveliness checks (kill -0) by adding retries.
> 5) Improve the resiliency of container removal by adding retries.
> 6) Only attempt to stop, kill, and remove containers if the current container state allows
for it.
> 7) Better handling of short lived containers when the container is stopped before the
PID can be retrieved. (YARN-6305)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message