hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4288) NodeManager restart should keep retrying to register to RM while connection exception happens during RM failed over.
Date Mon, 26 Oct 2015 22:25:28 GMT

    [ https://issues.apache.org/jira/browse/YARN-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975226#comment-14975226
] 

Junping Du commented on YARN-4288:
----------------------------------

Thanks [~vinodkv] for the comments.
bq. Why isn't existing RMProxy framework taking care of this?
RMProxy is supposed to take care of this. However, the way that RMProxy to do is to do retry
on specific (known) exceptions but fail directly for other exceptions. Like this case, IOException
get thrown will get failed directly without any retry (for non-HA case). We are a little risky
if more potential exception could get thrown during RM down time. For this particular case,
I can add the IOException (other than RemoteException) to be handled directly which sounds
a easy way of fix.

bq. Why are we putting special code in NodeStatusUpdater? Shouldn't we use something in the
RMProxy framework? See ServerProxy for example that gets used by NMClients.
As I mentioned above, having a white list of exceptions to retry doesn't sound robust enough:
if any exception we don't meet before, we could skip the retry unintentionally. Isn't it?
Anyway, I could fix the problem with following existing retry policy framework but hopefully
we could improve the framework in other JIRA.

bq. Just looked at YARN-4132 too, we should definitely see if we can merge these two together.
This is a bug that NM doesn't retry in some cases. YARN-4132 talk about another problem that
NM retry should be longer than general RMProxy client which is a more general improvement.
I think we'd better separate them out. Thoughts?

> NodeManager restart should keep retrying to register to RM while connection exception
happens during RM failed over.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4288
>                 URL: https://issues.apache.org/jira/browse/YARN-4288
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4288.patch
>
>
> When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with RPC which
could throw following exceptions when RM get restarted at the same time, like following exception
shows:
> {noformat}
> 2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222))
- Unexpected error rebooting NodeStatusUpdater
> java.io.IOException: Failed on local exception: java.io.IOException: Connection reset
by peer; Host Details : local host is: "172.27.62.28"; destination host is: "172.27.62.57":8025;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1473)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1400)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
>         at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
> 2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager (NodeManager.java:run(307)) - Error
while rebooting NodeStatusUpdater.
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: Failed on
local exception: java.io.IOException: Connection reset by peer; Host Details : local host
is: "172.27.62.28"; destination host is: "172.27.62.57":8025;
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
> Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Connection
reset by peer; Host Details : local host is: "ebdp-ch2-172.27.62.28"; destination host is:
"172.27.62.57":8025;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1473)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1400)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
>         at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
>         ... 1 more
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
> 2015-08-17 14:35:59,445 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
> 2015-08-17 14:35:59,547 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512))
- Applications still running : [application_1439417357296_45357, application_1439417357296_45403,
application_1439417357296_45355, application_1439417357296_45111, application_1439417357296_45452,
application_1439417357296_45350, application_1439417357296_45499, application_1439417357296_45205,
application_1439417357296_21009]
> 2015-08-17 14:35:59,548 INFO  ipc.Server (Server.java:stop(2469)) - Stopping server on
45454
> 2015-08-17 14:35:59,551 INFO  ipc.Server (Server.java:run(717)) - Stopping IPC Server
listener on 45454
> 2015-08-17 14:35:59,551 INFO  logaggregation.LogAggregationService (LogAggregationService.java:serviceStop(141))
- org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
waiting for pending aggregation during exit
> 2015-08-17 14:35:59,552 INFO  ipc.Server (Server.java:run(843)) - Stopping IPC Server
Responder
> {noformat}
> It will make NM restart get failed. We should have a simple fix to allow this register
to RM can retry with connection failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message