hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally
Date Sun, 02 Nov 2014 20:50:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194012#comment-14194012
] 

Vinod Kumar Vavilapalli commented on YARN-1922:
-----------------------------------------------

Here's my notes trying to understand the scenario
 - If the cleanup request came in before the launch happens, we simply skip both launch and
cleanup. This is taken care of by ContainersLauncher together with ContainerLaunch.shouldLaunchContainer.
 - The interesting case is when both launch and cleanup came in a the same time. This can
be split into more cases, depending on _when_ the cleanup-request came in.
    ## Launch failed and pid-file is not written, and _completed_ is true.
    ## Launch happened, pid-file is written but entire process-tree crashed immediately after,
so _completed_ is true.
    ## Launch happened, pid-file is written, root process crashed immediately after, but the
remaining process-tree is still running, so _completed_ is true.
    ## Launch happened, pid-file is written and process-tree is running, so and _completed_
is false.

>From the code, the current behavior is fine for (1), (2) and (4), but (3) is leaking processes
before the patch.

I don't think what the patch is doing is sufficient - it does address (3) but only if pid-file
was written with 100ms before the root-process crashes.

For comprehensively tackling (3), what we really need is to just wait till either pid-file
is written or for the timer to expire maxKillWaitTime. Waiting for once (100ms) is not enough
- I don't think your test is catching this, but it should be possible write a test for this
that will prove that the patch is not sufficient.

Agreed?

> Process group remains alive after container process is killed externally
> ------------------------------------------------------------------------
>
>                 Key: YARN-1922
>                 URL: https://issues.apache.org/jira/browse/YARN-1922
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>         Environment: CentOS 6.4
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>         Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, YARN-1922.4.patch,
YARN-1922.5.patch
>
>
> If the main container process is killed externally, ContainerLaunch does not kill the
rest of the process group.  Before sending the event that results in the ContainerLaunch.containerCleanup
method being called, ContainerLaunch sets the "completed" flag to true.  Then when cleaning
up, it doesn't try to read the pid file if the completed flag is true.  If it read the pid
file, it would proceed to send the container a kill signal.  In the case of the DefaultContainerExecutor,
this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message