hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naganarasimha G R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4459) container-executor might kill process wrongly
Date Mon, 04 Jan 2016 17:03:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081372#comment-15081372

Naganarasimha G R commented on YARN-4459:

Hi [~hex108],
thanks for working on this jira. I am not from c back ground,  neverthless checked the API
of kill and few doubts i have here
IIUC existing code checks whether container process has created any sub process then kill
all the process, else if its a single process then i presume {{kill(-pid,0)}} will return
{{-1}} then it tries to kill only the container process id only. Can you confirm this by testing?
I just tested this with unix command {{kill}} what i could understand was {{kill -0 -- -<pid
which has children>}} will be successfull and {{$?}} will return *0* but when i run {{kill
-0 -- -<pid which has NO children>}} then {{bash: kill: (-10967) - No such process}}
will thrown.
Correct me if my understanding is wrong.
cc/ @[~vvasudev].

> container-executor might kill process wrongly
> ---------------------------------------------
>                 Key: YARN-4459
>                 URL: https://issues.apache.org/jira/browse/YARN-4459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4459.01.patch, YARN-4459.02.patch
> When calling 'signal_container_as_user' in container-executor, it first checks whether
process group exists, if not, it will kill the process itself(if it the process exists). 
It is not reasonable because that the process group does not exist means corresponding container
has finished, if we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for starting NM
and submitted app, and container-executor sometimes killed NM(the wrongly killed process might
just be a newly started thread and was NM's child process).

This message was sent by Atlassian JIRA

View raw message