hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4459) container-executor might kill process wrongly
Date Mon, 11 Jan 2016 13:59:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091926#comment-15091926

Jun Gong commented on YARN-4459:

Thanks [~vvasudev] the review and suggestion.

assume that the process with the recycled pid does a setsid call - then the process group
check will succeed and we might still end up killing the wrong process, no?
Yes, it might kill a wrong process, I have not found a perfect method to avoid this. At least
it will not kill NM.

The assumptions of the patch are 
1) The new process will not belong to the same user and 
2) The new process has not called setsid
yeah, the *and* is *or* actually. It will kill a wrong process when new process belongs to
the same user and has been called setsid. The rate is lower.

I suspect we might need to add a timing check similar to the one proposed in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560578&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560578
Adding this check will reduce the rate that kills a wrong process.  Will add it later. It
could not avoid killing a wrong process either as [~zhiguohong] said in https://issues.apache.org/jira/browse/YARN-3678?focusedCommentId=14560748&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14560748.

> container-executor might kill process wrongly
> ---------------------------------------------
>                 Key: YARN-4459
>                 URL: https://issues.apache.org/jira/browse/YARN-4459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4459.01.patch, YARN-4459.02.patch
> When calling 'signal_container_as_user' in container-executor, it first checks whether
process group exists, if not, it will kill the process itself(if it the process exists). 
It is not reasonable because that the process group does not exist means corresponding container
has finished, if we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for starting NM
and submitted app, and container-executor sometimes killed NM(the wrongly killed process might
just be a newly started thread and was NM's child process).

This message was sent by Atlassian JIRA

View raw message