hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers
Date Mon, 04 May 2015 12:46:06 GMT

     [ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jun Gong resolved YARN-3474.
    Resolution: Invalid

> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>                 Key: YARN-3474
>                 URL: https://issues.apache.org/jira/browse/YARN-3474
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3474.01.patch
> When RM HA is enabled and active RM shuts down, standby RM will become active, recover
apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340];
RM could not connect with ZK well). NM will kill containers running on it when  it could not
heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will
be killed. 
> In production cluster, we might come across above cases and fixing these bugs might need
time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin
could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to
tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM
start normally, clear the flag.

This message was sent by Atlassian JIRA

View raw message