Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 16 Jul 2015 23:39:06 +0000 (UTC)
From: "Masatake Iwasaki (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12743088.1411364060000.210311.1437089946260@Atlassian.JIRA>
In-Reply-To: <JIRA.12743088.1411364060000@Atlassian.JIRA>
References: <JIRA.12743088.1411364060000@Atlassian.JIRA>
 <JIRA.12743088.1411364060001@arcas>
Subject: [jira] [Commented] (YARN-2578) NM does not failover timely if RM
 node network connection fails
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535 ] 

Masatake Iwasaki commented on YARN-2578:
----------------------------------------

bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping is false?

Just to make it clear that the timeout has no effect without setting {{ipc.client.ping}} to true.

> NM does not failover timely if RM node network connection fails
> ---------------------------------------------------------------
>
>                 Key: YARN-2578
>                 URL: https://issues.apache.org/jira/browse/YARN-2578
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.1
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>         Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a "service network stop" or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. The NM should than re-register with the new active RM. This re-register takes a long time (15 minutes or more). Until then the cluster has no nodes for processing and applications are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
>     node 1: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 2: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread which should be used in the re-register is the "Node Status Updater" This thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in Object.wait() [0x00007f5a51fc1000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
> 	at java.lang.Object.wait(Object.java:503)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1395)
> 	- locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
> 	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl. The generated proxy does not time out and we should be using a version which takes the RPC timeout (from the configuration) as a parameter.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)