hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naganarasimha G R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3644) Node manager shuts down if unable to connect with RM
Date Fri, 12 Jun 2015 07:25:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14583093#comment-14583093

Naganarasimha G R commented on YARN-3644:

Hi [~raju.bairishetti],
IIUC intention of this jira is to only make NM wait for RM infinitely and hence we don't want
to set  {{yarn.resourcemanager.connect.max-wait.ms}} to  FOREVER retry policy which might
affect other clients connecting to RM right ?
If so i feel overall approach is fine except for the cosmetic comments below

# configuration could be {{yarn.nodemanager.shutdown.on.connection.failures}} => {{yarn.nodemanager.shutdown.on.RM.connection.failures}}.
correct the same in yarn-default.xml's  description and name also
# Testcase introduces new {{MyNodeStatusUpdater6}} whose only change is to get the new Resource
tracker for the test case, its becoming more and more duplicate code for NodeStatusUpdater
as most of the other overloaded NodeStatusUpdater is also doing the same, so can we bring
in a common NodeStatusUpdater  class which accepts ResourceTracker  as parameter to constructor
? (may be refactoring other classes can be taken up in other jira if req)
# {{MyResourceTracker8}} could extend {{MyResourceTracker5}} and just override the required
methods. Would also appreciate if some documentation is added above these classes so that
in future it will be helpfull to reuse if req.

> Node manager shuts down if unable to connect with RM
> ----------------------------------------------------
>                 Key: YARN-3644
>                 URL: https://issues.apache.org/jira/browse/YARN-3644
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Srikanth Sundarrajan
>            Assignee: Raju Bairishetti
>         Attachments: YARN-3644.001.patch, YARN-3644.patch
> When NM is unable to connect to RM, NM shuts itself down.
> {code}
>           } catch (ConnectException e) {
>             //catch and throw the exception if tried MAX wait time to connect RM
>             dispatcher.getEventHandler().handle(
>                 new NodeManagerEvent(NodeManagerEventType.SHUTDOWN));
>             throw new YarnRuntimeException(e);
> {code}
> In large clusters, if RM is down for maintenance for longer period, all the NMs shuts
themselves down, requiring additional work to bring up the NMs.
> Setting the yarn.resourcemanager.connect.wait-ms to -1 has other side effects, where
non connection failures are being retried infinitely by all YarnClients (via RMProxy).

This message was sent by Atlassian JIRA

View raw message