hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
Date Tue, 28 Apr 2015 02:22:07 GMT

    [ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516201#comment-14516201
] 

Hadoop QA commented on YARN-2578:
---------------------------------

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 40s | Pre-patch trunk compilation is healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any @author tags.
|
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear to include any
new or modified tests.  Please justify why no new tests are needed for this patch. Also please
list what manual steps were performed to verify this patch. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that end in whitespace.
|
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc warning messages.
|
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does not increase
the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   5m 24s | The applied patch generated  1  additional
checkstyle issues. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with eclipse:eclipse.
|
| {color:green}+1{color} | findbugs |   0m 52s | The patch does not introduce any new Findbugs
(version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in hadoop-yarn-server-common.
|
| | |  41m  9s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / feb68cb |
| checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/checkstyle-result-diff.txt
|
| hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
|
| Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7518/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep
3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7518/console |


This message was automatically generated.

> NM does not failover timely if RM node network connection fails
> ---------------------------------------------------------------
>
>                 Key: YARN-2578
>                 URL: https://issues.apache.org/jira/browse/YARN-2578
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.5.1
>            Reporter: Wilfred Spiegelenburg
>         Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is unplugged or
the failure is simulated by a "service network stop" or a firewall that drops all traffic
on the node. The RM fails over to the standby node when the failure is detected as expected.
The NM should than re-register with the new active RM. This re-register takes a long time
(15 minutes or more). Until then the cluster has no nodes for processing and applications
are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
>     node 1: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 2: ZK, NN, JN, ZKFC, DN, RM, NM
>     node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the network kills
from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread which should
be used in the re-register is the "Node Status Updater" This thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in Object.wait() [0x00007f5a51fc1000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
> 	at java.lang.Object.wait(Object.java:503)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1395)
> 	- locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call)
> 	at org.apache.hadoop.ipc.Client.call(Client.java:1362)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> 	at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
> 	at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the ResourceTrackerPBClientImpl.
The generated proxy does not time out and we should be using a version which takes the RPC
timeout (from the configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message