Return-Path: X-Original-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C58E18598 for ; Mon, 4 May 2015 12:46:07 +0000 (UTC) Received: (qmail 41372 invoked by uid 500); 4 May 2015 12:46:06 -0000 Delivered-To: apmail-hadoop-yarn-dev-archive@hadoop.apache.org Received: (qmail 41285 invoked by uid 500); 4 May 2015 12:46:06 -0000 Mailing-List: contact yarn-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-dev@hadoop.apache.org Delivered-To: mailing list yarn-dev@hadoop.apache.org Received: (qmail 40983 invoked by uid 99); 4 May 2015 12:46:06 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 May 2015 12:46:06 +0000 Date: Mon, 4 May 2015 12:46:06 +0000 (UTC) From: "Jun Gong (JIRA)" To: yarn-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Resolved] (YARN-3474) Add a way to let NM wait RM to come back, not kill running containers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Gong resolved YARN-3474. ---------------------------- Resolution: Invalid > Add a way to let NM wait RM to come back, not kill running containers > --------------------------------------------------------------------- > > Key: YARN-3474 > URL: https://issues.apache.org/jira/browse/YARN-3474 > Project: Hadoop YARN > Issue Type: New Feature > Affects Versions: 2.6.0 > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3474.01.patch > > > When RM HA is enabled and active RM shuts down, standby RM will become active, recover apps and attempts. Apps will not be affected. > If there are some cases or bugs that cause both RM could not start normally(e.g. [YARN-2340|https://issues.apache.org/jira/browse/YARN-2340]; RM could not connect with ZK well). NM will kill containers running on it when it could not heartbeat with RM for some time(max retry time is 15 mins by default). Then all apps will be killed. > In production cluster, we might come across above cases and fixing these bugs might need time more than 15 mins. In order to let apps not be affected and killed by NM, YARN admin could set a flag(the flag is a znode '/wait-rm-to-come-back/cluster-id' in our solution) to tell NM wait for RM to come back and not kill running containers. After fixing bugs and RM start normally, clear the flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)