hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Srikanth Sundarrajan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
Date Mon, 18 May 2015 04:02:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14547506#comment-14547506
] 

Srikanth Sundarrajan commented on YARN-3644:
--------------------------------------------

[~vinodkv], YARN-3644 is independent of this. In our setup we ran into this before we ran
into YARN-3646. NM gives up trying for about 30 odd mts by default (default settings) before
*attempting* to shut itself down. Is there an issue if this wait time is much (infinitely)
longer (for both HA & Non-HA setup). An orthogonal issue is that when NM attempts to shut
itself down, it doesn't actually go down and lingers around for days without actually accepting
any containers, unless restarted (will file another issue for this).

> Node manager shuts down if unable to connect with RM
> ----------------------------------------------------
>
>                 Key: YARN-3644
>                 URL: https://issues.apache.org/jira/browse/YARN-3644
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Srikanth Sundarrajan
>
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>           } catch (ConnectException e) {
>             //catch and throw the exception if tried MAX wait time to connect RM
>             dispatcher.getEventHandler().handle(
>                 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
>             throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the NMs shuts
themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where
non connection failures are being retried infinitely by all YarnClients (via RMProxy).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message