hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Badger (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5366) Improve handling of the Docker container life cycle
Date Thu, 09 Nov 2017 16:51:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16246006#comment-16246006

Eric Badger commented on YARN-5366:

Hey [~shanekumpf@gmail.com], thanks for the response. I have a few followup questions. 

Without detached mode, stdin gets attached to the process in the container. This results in
container-executor waiting for the docker run command to complete before performing the remaining
tasks, like getting the pid, setting up the cgroup tasks file. Since the container is already
finished when c-e resumes, it's not possible to get the PID. Without the PID file, ContainersMonitor,
container signaling, etc fails. This would require non-trivial changes to container-executor,
beyond swapping the flags if we take the default behavior.
Can't we run the docker run command in the background? Wouldn't that fix this issue and allow
us to continue along in the container-executor as planned?

It would be nice to have the ability to keep a container around for debugging and clean it
up after the debug delay has passed. This is what YARN-5366 really started out to do.
Yea I agree that this would be really useful. This is actually one of the reasons that I'm
looking into this detached mode issue.

The --rm doesn't always work. I've seen containers get stuck in a "dead" state due to mount
namespace leaks and --rm left it abandoned. Explicitly managing the lifecycle still has these
challenges, but we can control removal retries if we encounter an issue.
Can you point to some examples of --rm not working? Are there issues that have been reported
to the Docker community that we can follow? Has this been fixed in a newer release of Docker?
And it makes sense that we could retry removals in the container executor if we control it
explicitly, but now we're just changing the surface area of a possible bug from the docker
daemon to the container executor. If the container-executor fails for whatever reason, then
the container will stick around, just as it would if the docker daemon fails to remove it.
I would generally think that the docker daemon would be better at removing containers since
this is a docker issue, not a hadoop issue. 

It looks like for GPU support, named volumes were added. I haven't spent the time to understand
exactly how it is used, but --rm will remove volumes in some cases, so we'd need to better
understand the use to avoid inadvertently deleting data.
Hmm. This is a more interesting issue. I don't know how named volumes and --rm work together.
Hopefully there's a way that they can work together.

Overall, if we can get it to work, I think using the --rm option simplifies a lot of things
for us and reduces the possibility for bugs to be introduced in the container-executor. 

> Improve handling of the Docker container life cycle
> ---------------------------------------------------
>                 Key: YARN-5366
>                 URL: https://issues.apache.org/jira/browse/YARN-5366
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn
>            Reporter: Shane Kumpf
>            Assignee: Shane Kumpf
>              Labels: oct16-medium
>         Attachments: YARN-5366.001.patch, YARN-5366.002.patch, YARN-5366.003.patch, YARN-5366.004.patch,
YARN-5366.005.patch, YARN-5366.006.patch
> There are several paths that need to be improved with regard to the Docker container
lifecycle when running Docker containers on YARN.
> 1) Provide the ability to keep a container on the NodeManager for a set period of time
for debugging purposes.
> 2) Support sending signals to the process in the container to allow for triggering stack
traces, heap dumps, etc.
> 3) Support for Docker's live restore, which means moving away from the use of {{docker
wait}}. (YARN-5818)
> 4) Improve the resiliency of liveliness checks (kill -0) by adding retries.
> 5) Improve the resiliency of container removal by adding retries.
> 6) Only attempt to stop, kill, and remove containers if the current container state allows
for it.
> 7) Better handling of short lived containers when the container is stopped before the
PID can be retrieved. (YARN-6305)

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message