hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Zhiguo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3678) DelayedProcessKiller may kill other process other than container
Date Wed, 27 May 2015 07:46:18 GMT

    [ https://issues.apache.org/jira/browse/YARN-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560578#comment-14560578
] 

Hong Zhiguo commented on YARN-3678:
-----------------------------------

We met same issue on our production cluster last year.  The same user  is used for NM and
some app-submitter.
I hacked the kernel __send_signal function via kprobe (https://github.com/honkiko/signal-monitor)
and confirmed the happening:
 - container-executor sends SIGTERM to a container (say, pid = X)
 - The container exits quickly (in 250ms)
 - pid X is recycled and taken by a new spawned thread of NM
 - after 250ms, container-executor sends SIGKILL to pid X
 - NM is killed

I added checking of living time before container-executor sends SIGKILL. If the process has
living time shorter than 250ms,  it's not the target process that we send SIGTERM to, and
just skip it.

With this fix, the "accident" rate is reduced from several times per day to nearly zero.
If you think such fix is acceptable, I'll post it here.

> DelayedProcessKiller may kill other process other than container
> ----------------------------------------------------------------
>
>                 Key: YARN-3678
>                 URL: https://issues.apache.org/jira/browse/YARN-3678
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: gu-chi
>            Priority: Critical
>
> Suppose one container finished, then it will do clean up, the PID file still exist and
will trigger once singalContainer, this will kill the process with the pid in PID file, but
as container already finished, so this PID may be occupied by other process, this may cause
serious issue.
> As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely
occur.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message