hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Date Fri, 17 Jun 2016 04:59:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15335408#comment-15335408

Wangda Tan commented on YARN-5214:

Thanks [~djp],

I think the patch added RW lock can generally reduce the time spent on locking. However, I
think this may not be able to solve the entire problem.

Per my understanding, even after R/W lock changes, when anything bad happens on disks, DirectoryCollection
will be stuck under write locks, so NodeStatusUpdater will be blocked as well.

I think there're two fixes that we can make to tackle the problem:
1) In short term, errorDirs/fullDirs/localDirs are copy-on-write list, so we don't need to
acquire lock getGoodDirs/getFailedDirs/getFailedDirs. This could lead to inconsistency data
in rare cases, but I think in general this is safe and inconsistency data will be updated
in next heartbeat.

2) In longer term, we may need to consider a DirectoryCollection stuck under busy IO is unhealthy
state, NodeStatusUpdater should be able to report such status to RM, so RM will avoid allocating
any new containers to such nodes. [~nroberts] suggested the same thing.


> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>                 Key: YARN-5214
>                 URL: https://issues.apache.org/jira/browse/YARN-5214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-5214.patch
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and
marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater
thread get blocked:
> 1.  Node Status Updater thread get blocked by 0x000000008065eae8 
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting
for monitor entry [0x00007f035945a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
>         - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd
runnable [0x00007f035e511000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.createDirectory(Native Method)
>         at java.io.File.mkdir(File.java:1316)
>         at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
>         - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in high IO throughput
case and we should have fine-grained lock for related operations here. 
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have
similar fix here.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message