hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
Date Tue, 21 Jun 2016 18:55:57 GMT

    [ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15342451#comment-15342451
] 

Junping Du commented on YARN-5214:
----------------------------------

Thanks [~leftnoteasy] and [~vinodkv] for review and comments!
bq.  dirsChangeListeners doesn't change except on service-start and stop. So no need to grab
the global / read / write lock, we can simply make it use a thread-safe collection?
That's good point. Will get rid of lock for dirsChangeListeners in next patch.

bq. In createNonExistentDirs(), you don't need to make a manual copy of localDirs, the iterator()
method already does it for you.
I would like to get rid of CopyOnWriteArrayList and replace with normal ArrayList given we
already have read/write lock on related dirs. In createNonExistentDirs(), explicitly copy
localDirs() under a read lock and make createDir() lock free is something more clear. Isn't
it?

bq.  Can you not use the java8 stuff (diamond operator etc), so that this patch can be backported
to the older releases?
Diamond operator is supported since Java 7...

> Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater
> --------------------------------------------------------------------------------------------
>
>                 Key: YARN-5214
>                 URL: https://issues.apache.org/jira/browse/YARN-5214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-5214.patch
>
>
> In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and
marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater
thread get blocked:
> 1.  Node Status Updater thread get blocked by 0x000000008065eae8 
> {noformat}
> "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting
for monitor entry [0x00007f035945a000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170)
>         - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> 2. The actual holder of this lock is DiskHealthMonitor:
> {noformat}
> "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd
runnable [0x00007f035e511000]
>    java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.createDirectory(Native Method)
>         at java.io.File.mkdir(File.java:1316)
>         at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67)
>         at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312)
>         at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231)
>         - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50)
>         at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122)
>         at java.util.TimerThread.mainLoop(Timer.java:555)
>         at java.util.TimerThread.run(Timer.java:505)
> {noformat}
> This disk operation could take longer time than expectation especially in high IO throughput
case and we should have fine-grained lock for related operations here. 
> The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have
similar fix here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message