hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hu Ziqian (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-8382) cgroup file leak in NM
Date Sun, 03 Jun 2018 03:54:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-8382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hu Ziqian updated YARN-8382:
----------------------------
    Attachment: YARN-8382.002.patch
                YARN-8382-branch-2.8.3.002.patch

> cgroup file leak in NM
> ----------------------
>
>                 Key: YARN-8382
>                 URL: https://issues.apache.org/jira/browse/YARN-8382
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>         Environment: we write an container with a shutdownHook which has a piece of
code like  "while(true) sleep(100)" . when *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms <* *yarn.nodemanager.sleep-delay-before-sigkill.ms
, cgourp file leak happens; when* *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms
>* ** *yarn.nodemanager.sleep-delay-before-sigkill.ms, cgroup file is deleted successfully***
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Major
>         Attachments: YARN-8382-branch-2.8.3.001.patch, YARN-8382-branch-2.8.3.002.patch,
YARN-8382.001.patch, YARN-8382.002.patch
>
>
> As Jiandan said in YARN-6525, NM may delete  Cgroup container file timeout with logs
like below:
> org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler: Unable to
delete cgroup at: /cgroup/cpu/hadoop-yarn/container_xxx, tried to delete for 1000ms
>  
> we found one situation is that when we set *yarn.nodemanager.sleep-delay-before-sigkill.ms*
bigger than *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms*, the cgroup
file leak happens *.* 
>  
> One container process tree looks like follow graph:
> bash(16097)───java(16099)─┬─\{java}(16100) 
>                                                   ├─\{java}(16101) 
> {{                       ├─\{java}(16102)}}
>  
> {{when NM kills a container, NM sends kill -15 -pid to kill container process group.
Bash process will exit when it received sigterm, but java process may do some job (shutdownHook
etc.), and doesn't exit unit receive sigkill. And when bash process exits, CgroupsLCEResourcesHandler
begin to try to delete cgroup files. So when *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* arrived,
the java processes may still running and cgourp/tasks still not empty and cause a cgroup file
leak.}}
>  
> {{we add a condition that *yarn.nodemanager.linux-container-executor.cgroups.delete-timeout-ms* must
bigger than *yarn.nodemanager.sleep-delay-before-sigkill.ms* to solve this problem.}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message