hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7973) Support ContainerRelaunch for Docker containers
Date Fri, 16 Mar 2018 12:34:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16401818#comment-16401818

Shane Kumpf commented on YARN-7973:

[~billie.rinaldi] - I looked into the issue you reported. The behavior you see occurs with
or without this patch.

What you see above repeated over and over is the Diagnostics field being returned during the
ContainerStatus calls. Pulling out only the Diagnostics field from above you get:
Diagnostics: [2018-03-08 22:02:53.397]Exception from container-launch.
Container id: container_1520546307703_0001_01_000002
Exit code: -1
Exception message: <unknown>
Shell output: <unknown>

[2018-03-08 22:02:53.500]Diagnostic message from attempt 0 : [2018-03-08 22:02:53.500]
[2018-03-08 22:02:53.501]Container exited with a non-zero exit code -1.
You will see this repeated once per second until the relaunch occurs again (30 seconds by
default with native services). Once the relaunch occurs, you will see the exception that the
relaunch failed, as the container isn't in a startable state. I could be convinced to call
launchContainer in this case to produce the original error if you feel that is most appropriate,
but I think there are alternative improvements to make here:
 * The logs are hard to follow with the diagnostics embedded in the log entry when returning
the ContainerStatus. It looks like exceptions are repeated over and over, as you saw. We should
consider moving this to debug logging.
 * Populate diagnostics with a better error in this case. The {{ContainerExecutionExecption}}
thrown as part of this ACL check does not become part of the Diagnostics field.
 * Native Services currently uses {{ContainerRetryPolicy.RETRY_ON_ALL_ERRORS}} which may be
too broad. -1 exit codes should likely be hard fails.

I'll open issues on these if that sounds good?

> Support ContainerRelaunch for Docker containers
> -----------------------------------------------
>                 Key: YARN-7973
>                 URL: https://issues.apache.org/jira/browse/YARN-7973
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Shane Kumpf
>            Assignee: Shane Kumpf
>            Priority: Major
>         Attachments: YARN-7973.001.patch, YARN-7973.002.patch
> Prior to YARN-5366, {{container-executor}} would remove the Docker container when it
exited. The removal is now handled by the {{DockerLinuxContainerRuntime}}. {{ContainerRelaunch}} is
intended to reuse the workdir from the previous attempt, and does not call {{cleanupContainer}} prior
to {{launchContainer}}. The container ID is reused as well. As a result, the previous Docker
container still exists, resulting in an error from Docker indicating the a container by that
name already exists.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message