hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danil Serdyuchenko (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-13669) YARN in HA not failing over to a new resource manager.
Date Mon, 11 Jun 2018 12:47:00 GMT
Danil Serdyuchenko created HDFS-13669:
-----------------------------------------

             Summary: YARN in HA not failing over to a new resource manager.
                 Key: HDFS-13669
                 URL: https://issues.apache.org/jira/browse/HDFS-13669
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Danil Serdyuchenko


We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating one of the RMs.
 # Recreated a standby RM (rm2), which gave it a new IP
 # Stopped the active RM (rm1)
 # NMs tried to failover to rm2, but were timing out because of the old ip.
 # NMs reach the configured 30 failover retries and shutdown.

We get the following logs.
{noformat}
18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: yarnrm2/x.x.x.x:8031 New:
yarnrm2/y.y.y.y:8031
18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking nodeHeartbeat
of class ResourceTrackerPBClientImpl over rm2 after 25 fail over attempts. Trying to fail
over after sleeping for 37191ms.
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a to yarnrm2:8031
failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=yarnrm2/x.x.x.x:8031]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting
for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending
remote=yarnrm2.grappler.eu-west-1.prod.aws.skyscanner.local/10.51.104.136:8031]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
        at org.apache.hadoop.ipc.Client.call(Client.java:1446)
        ... 12 more{noformat}
We get this and failover back to rm1 30 times until:
{noformat}
18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat
over rm1. Not retrying because failovers (30) exceeded maximum allowed (30){noformat}
>From the logs it appears that the timeouts happen because it's trying to connect to the
old ip (x.x.x.x in the logs). Looking at the code of the Client class, following the updateAddress
method call we should expect a retry with the new server ip ("Retrying connect to server ..."
log) up to 

ipc.client.connect.max.retries.on.timeouts times. However we never see the retry logs and
it just fails with exception. The above setting is set to default 45 for all of our NMs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message