From yarn-issues-return-88428-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Wed Jun 8 20:04:22 2016 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 765EE195B4 for ; Wed, 8 Jun 2016 20:04:22 +0000 (UTC) Received: (qmail 85598 invoked by uid 500); 8 Jun 2016 20:04:21 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 85542 invoked by uid 500); 8 Jun 2016 20:04:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 85524 invoked by uid 99); 8 Jun 2016 20:04:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Jun 2016 20:04:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 048732C1F68 for ; Wed, 8 Jun 2016 20:04:21 +0000 (UTC) Date: Wed, 8 Jun 2016 20:04:21 +0000 (UTC) From: "Nathan Roberts (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5214) Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-5214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15321342#comment-15321342 ] Nathan Roberts commented on YARN-5214: -------------------------------------- I'm not suggesting this change shouldn't be made but keep in mind that if the NM is having trouble performing this type of action within the timeout (10 minutes or so), then the node is not very healthy and probably shouldn't be given anything more to run until the situation improves. It's going to have trouble doing all sorts of other things as well so having it look unhealthy in some fashion isn't all bad. If we somehow keep heartbeats completely free of I/O, then the RM will keep assigning containers that will likely run into exactly the same slowness. We used to see similar issues that we resolved by switching to the deadline I/O scheduler (assuming linux). See https://issues.apache.org/jira/browse/HDFS-9239?focusedCommentId=15218302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15218302 > Pending on synchronized method DirectoryCollection#checkDirs can hang NM's NodeStatusUpdater > -------------------------------------------------------------------------------------------- > > Key: YARN-5214 > URL: https://issues.apache.org/jira/browse/YARN-5214 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > > In one cluster, we notice NM's heartbeat to RM is suddenly stopped and wait a while and marked LOST by RM. From the log, the NM daemon is still running, but jstack hints NM's NodeStatusUpdater thread get blocked: > 1. Node Status Updater thread get blocked by 0x000000008065eae8 > {noformat} > "Node Status Updater" #191 prio=5 os_prio=0 tid=0x00007f0354194000 nid=0x26fa waiting for monitor entry [0x00007f035945a000] > java.lang.Thread.State: BLOCKED (on object monitor) > at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.getFailedDirs(DirectoryCollection.java:170) > - waiting to lock <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.getDisksHealthReport(LocalDirsHandlerService.java:287) > at org.apache.hadoop.yarn.server.nodemanager.NodeHealthCheckerService.getHealthReport(NodeHealthCheckerService.java:58) > at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:389) > at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:83) > at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:643) > at java.lang.Thread.run(Thread.java:745) > {noformat} > 2. The actual holder of this lock is DiskHealthMonitor: > {noformat} > "DiskHealthMonitor-Timer" #132 daemon prio=5 os_prio=0 tid=0x00007f0397393000 nid=0x26bd runnable [0x00007f035e511000] > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createDirectory(Native Method) > at java.io.File.mkdir(File.java:1316) > at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsCheck(DiskChecker.java:67) > at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:104) > at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.verifyDirUsingMkdir(DirectoryCollection.java:340) > at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.testDirs(DirectoryCollection.java:312) > at org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection.checkDirs(DirectoryCollection.java:231) > - locked <0x000000008065eae8> (a org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection) > at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.checkDirs(LocalDirsHandlerService.java:389) > at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService.access$400(LocalDirsHandlerService.java:50) > at org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService$MonitoringTimerTask.run(LocalDirsHandlerService.java:122) > at java.util.TimerThread.mainLoop(Timer.java:555) > at java.util.TimerThread.run(Timer.java:505) > {noformat} > This disk operation could take longer time than expectation especially in high IO throughput case and we should have fine-grained lock for related operations here. > The same issue on HDFS get raised and fixed in HDFS-7489, and we probably should have similar fix here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org