hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Date Thu, 09 Jun 2016 18:44:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323073#comment-15323073
] 

Junping Du commented on YARN-5214:
----------------------------------

Thanks [~nroberts] for sharing the solution on this! 
I agree that to fix the root cause of this particular issue, we may need to configure deadline
IO scheduler in Linux. Otherwise, IO waiting too long time should definitely cause other serious
issues, like we also noticed that ResourceLocalizationService get blocked as well.
On the other side, we need to check if hanging NM heartbeat or localizer in case of busy IO
with wrong IO scheduler setting is something we really want here: at least, we should replace
the synchronized method lock with something we can try to lock and print some useful debug
log if pending too long time. May be we can do more with the same principle of HDFS-9239 that
to release unnecessary lock for NM-RM heartbeat as much as possible? Thoughts?

> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-5214
>                 URL: https://issues.apache.org/jira/browse/YARN-5214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and
marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater
thread get blocked:
> 1.  Node Status Updater thread get blocked by 0x000000008065eae8 
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting
for monitor entry [0x00007f035945a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
>         - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd
runnable [0x00007f035e511000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.createDirectory(Native Method)
>         at java.io.File.mkdir(File.java:1316)
>         at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
>         - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in high IO throughput
case and we should have fine-grained lock for related operations here. 
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have
similar fix here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message