Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-dev@hadoop.apache.org
Date: Mon, 4 May 2015 12:46:06 +0000 (UTC)
From: "Jun Gong (JIRA)" <jira@apache.org>
To: yarn-dev@hadoop.apache.org
Message-ID: <JIRA.12820078.1428674636000.69683.1430743566595@Atlassian.JIRA>
In-Reply-To: <JIRA.12820078.1428674636000@Atlassian.JIRA>
References: <JIRA.12820078.1428674636000@Atlassian.JIRA>
 <JIRA.12820078.1428674636679@arcas>
Subject: [jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come
 back, not kill running containers
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jun Gong resolved YARN-3474.
----------------------------
    Resolution: Invalid

> Add a way to let NM wait RM to come back, not kill running containers
> ---------------------------------------------------------------------
>
>                 Key: YARN-3474
>                 URL: https://issues.apache.org/jira/browse/YARN-3474
>             Project: Hadoop YARN
>          Issue Type: New Feature
>    Affects Versions: 2.6.0
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-3474.01.patch
>
>
> When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. 
> If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when  it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. 
> In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)