hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4459) container-executor might kill process wrongly
Date Tue, 05 Jan 2016 01:14:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082158#comment-15082158
] 

Jun Gong commented on YARN-4459:
--------------------------------

Thanks [~Naganarasimha] for the review, very appreciate it!

{quote}
IIUC existing code checks whether container process has created any sub process then kill
all the process, else if its a single process then i presume kill(-pid,0) will return -1 then
it tries to kill only the container process id only. Can you confirm this by testing?
{quote}
I do not have a good method to add a test case for it now. I will try to explain it by following
test cases:
1. If container's parent process does not exist and its child process does exist, existing
code works well.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep 6666 &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep 6666' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep 6666' : 26877, its pgid : 26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26877     1  0 08:48 pts/0    00:00:00 sleep 6666
root     26880 26797  0 08:48 pts/0    00:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 -26876
root@bd74d89a2294:/$ kill -15 -26876
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26882 26797  0 08:48 pts/0    00:00:00 grep sleep
{code}

2. If container's parent process does not exist and its child process does not exist either,
existing code will kill process wrongly.
{code}
root@bd74d89a2294:/$ cat test.sh 
sleep 2 &
PID_FOR_SLEEP=$!
PGID_FOR_SLEEP=$(ps -p $PID_FOR_SLEEP -o pgid=)
echo "PID for 'sleep 2' : $PID_FOR_SLEEP, its pgid : $PGID_FOR_SLEEP"
root@bd74d89a2294:/$ ./test.sh 
PID for 'sleep 2' : 26890, its pgid : 26889
root@bd74d89a2294:/$ ps -ef | grep sleep
root     26893 26797  0 08:56 pts/0    00:00:00 grep sleep
root@bd74d89a2294:/$ kill -0 26889
-bash: kill: (26889) - No such process
{code}
Then we check existing code in container-executor.c for the above case:
{code}
  if (kill(-pid,0) < 0) {
    if (kill(pid, 0) < 0) {
      if (errno == ESRCH) {
        return INVALID_CONTAINER_PID;
      }
      fprintf(LOGFILE, "Error signalling container %d with %d - %s\n",
	      pid, sig, strerror(errno));
      return -1;
    } else {
      has_group = 0;
    }
  }

  if (kill((has_group ? -1 : 1) * pid, sig) < 0) {
{code}
*kill(-pid,0)* will return -1. If *pid* is reused for a new process(suppose that the container
has survived for a long time, pid recycle occurs, then this pid might be reused again), *kill(pid,
0)* will return 0, then *has_group* will be set to 0, and the code *kill((has_group ? -1 :
1) * pid, sig)* will try to kill *pid* with *sig*. This is the problem.

> container-executor might kill process wrongly
> ---------------------------------------------
>
>                 Key: YARN-4459
>                 URL: https://issues.apache.org/jira/browse/YARN-4459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first checks whether
process group exists, if not, it will kill the process itself(if it the process exists). 
It is not reasonable because that the process group does not exist means corresponding container
has finished, if we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for starting NM
and submitted app, and container-executor sometimes killed NM(the wrongly killed process might
just be a newly started thread and was NM's child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message