hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nathan Roberts (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Date Wed, 08 Jun 2016 20:04:21 GMT

    [ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321342#comment-15321342

Nathan Roberts commented on YARN-5214:

I'm not suggesting this change shouldn't be made but keep in mind that if the NM is having
trouble performing this type of action within the timeout (10 minutes or so), then the node
is not very healthy and probably shouldn't be given anything more to run until the situation
improves. It's going to have trouble doing all sorts of other things as well so having it
look unhealthy in some fashion isn't all bad. If we somehow keep heartbeats completely free
of I/O, then the RM will keep assigning containers that will likely run into exactly the same

We used to see similar issues that we resolved by switching to the deadline I/O scheduler
(assuming linux). See https://issues.apache.org/jira/browse/HDFS-9239?focusedCommentId=15218302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15218302

> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>                 Key: YARN-5214
>                 URL: https://issues.apache.org/jira/browse/YARN-5214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and
marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater
thread get blocked:
> 1.  Node Status Updater thread get blocked by 0x000000008065eae8 
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting
for monitor entry [0x00007f035945a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
>         - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd
runnable [0x00007f035e511000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.createDirectory(Native Method)
>         at java.io.File.mkdir(File.java:1316)
>         at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
>         - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in high IO throughput
case and we should have fine-grained lock for related operations here. 
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have
similar fix here.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message