hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gu-chi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1922) Process group remains alive after container process is killed externally
Date Mon, 18 May 2015 13:32:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547997#comment-14547997
] 

gu-chi commented on YARN-1922:
------------------------------

Hi, I see you comment here to check in YARN-1922.5.patch, but why YARN-1922.6.patch merged?
What is the concern?
I find this solution may have defect.
Suppose one container finished, then it will do clean up, the PID file still exist and will
trigger once singalContainer, this will kill the process with the pid in PID file, but as
container already finished, so this PID may be occupied by other process, this may cause serious
issue.
As I know, my NM was killed unexpectedly, what I described can be the cause. Even rarely occur.
Below is error scenario, task clean up not finished but NM was killed, then started

2015-05-14 21:49:03,063 | INFO  | DeletionService #1 | Deleting absolute path : /export/data1/yarn/nm/localdir/usercache/omm/appcache/application_1430456703237_8047/container_1430456703237_8047_01_12582917
| org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.deleteAsUser(LinuxContainerExecutor.java:400)
2015-05-14 21:49:03,063 | INFO  | AsyncDispatcher event handler | Container container_1430456703237_8047_01_12582917
transitioned from EXITED_WITH_SUCCESS to DONE | org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:918)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Removing container_1430456703237_8047_01_12582917
from application application_1430456703237_8047 | org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl$ContainerDoneTransition.transition(ApplicationImpl.java:340)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Considering container container_1430456703237_8047_01_12582917
for log-aggregation | org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.startContainerLogAggregation(AppLogAggregatorImpl.java:342)
2015-05-14 21:49:03,064 | INFO  | AsyncDispatcher event handler | Got event CONTAINER_STOP
for appId application_1430456703237_8047 | org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.handle(AuxServices.java:196)
2015-05-14 21:49:03,152 | INFO  | Node Status Updater | Removed completed containers from
NM context: [container_1430456703237_8047_01_12582917] | org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.removeCompletedContainersFromContext(NodeStatusUpdaterImpl.java:417)
2015-05-14 21:49:03,293 | INFO  | Task killer for 26924 | Using linux-container-executor.users
as omm | org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.signalContainer(LinuxContainerExecutor.java:349)
2015-05-14 21:49:20,667 | INFO  | main | STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NodeManager
STARTUP_MSG:   host = SR6S11/192.168.10.21
STARTUP_MSG:   args = []
STARTUP_MSG:   version = V100R001C00
STARTUP_MSG:   classpath = 

> Process group remains alive after container process is killed externally
> ------------------------------------------------------------------------
>
>                 Key: YARN-1922
>                 URL: https://issues.apache.org/jira/browse/YARN-1922
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.0
>         Environment: CentOS 6.4
>            Reporter: Billie Rinaldi
>            Assignee: Billie Rinaldi
>             Fix For: 2.6.0
>
>         Attachments: YARN-1922.1.patch, YARN-1922.2.patch, YARN-1922.3.patch, YARN-1922.4.patch,
YARN-1922.5.patch, YARN-1922.6.patch
>
>
> If the main container process is killed externally, ContainerLaunch does not kill the
rest of the process group.  Before sending the event that results in the ContainerLaunch.containerCleanup
method being called, ContainerLaunch sets the "completed" flag to true.  Then when cleaning
up, it doesn't try to read the pid file if the completed flag is true.  If it read the pid
file, it would proceed to send the container a kill signal.  In the case of the DefaultContainerExecutor,
this would kill the process group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message