Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AEDED1871E for ; Thu, 16 Jul 2015 23:39:06 +0000 (UTC) Received: (qmail 57675 invoked by uid 500); 16 Jul 2015 23:39:06 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 57631 invoked by uid 500); 16 Jul 2015 23:39:06 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 57615 invoked by uid 99); 16 Jul 2015 23:39:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Jul 2015 23:39:06 +0000 Date: Thu, 16 Jul 2015 23:39:06 +0000 (UTC) From: "Masatake Iwasaki (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535 ] Masatake Iwasaki commented on YARN-2578: ---------------------------------------- bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping is false? Just to make it clear that the timeout has no effect without setting {{ipc.client.ping}} to true. > NM does not failover timely if RM node network connection fails > --------------------------------------------------------------- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.1 > Reporter: Wilfred Spiegelenburg > Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a "service network stop" or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the "Node Status Updater" This thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in Object.wait() [0x00007f5a51fc1000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)