hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
Date Fri, 10 Apr 2015 15:18:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14489761#comment-14489761
] 

Junping Du commented on YARN-3474:
----------------------------------

Can we just set "yarn.resourcemanager.connect.max-wait.ms" to some larger value than 900 seconds?
YARN admin could make mistake that forget to set flag back, in that case applications and
containers could pending forever. So whatever ways, we need a timeout here (to get rid of
fault operation). 
For your concrete scenario, one interesting topic is we may allow admin to extend the timeout
when cluster is on the fly. Probably, through ZKNode because RM is unavailable but that could
bring extra configuration complexity.

> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>
>                 Key: YARN-3474
>                 URL: https://issues.apache.org/jira/browse/YARN-3474
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>
> When RM HA is enabled and active RM shuts down, standby RM will become active, recover
apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340];
RM could not connect with ZK well). NM will kill containers running on it when  it could not
heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will
be killed. 
> In production cluster, we might come across above cases and fixing these bugs might need
time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin
could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to
tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM
start normally, clear the flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message