Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B26D4182F0 for ; Tue, 1 Dec 2015 19:18:11 +0000 (UTC) Received: (qmail 92368 invoked by uid 500); 1 Dec 2015 19:18:11 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 92298 invoked by uid 500); 1 Dec 2015 19:18:11 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 92095 invoked by uid 99); 1 Dec 2015 19:18:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Dec 2015 19:18:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 0BB222C1F71 for ; Tue, 1 Dec 2015 19:18:11 +0000 (UTC) Date: Tue, 1 Dec 2015 19:18:11 +0000 (UTC) From: "Daniel Templeton (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-4401) A failed app recovery should not prevent the RM from starting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Templeton updated YARN-4401: ----------------------------------- Attachment: YARN-4401.001.patch Here's the basic idea of what I'm proposing. > A failed app recovery should not prevent the RM from starting > ------------------------------------------------------------- > > Key: YARN-4401 > URL: https://issues.apache.org/jira/browse/YARN-4401 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Affects Versions: 2.7.1 > Reporter: Daniel Templeton > Assignee: Daniel Templeton > Priority: Critical > Attachments: YARN-4401.001.patch > > > There are many different reasons why an app recovery could fail with an exception, causing the RM start to be aborted. If that happens the RM will fail to start. Presumably, the reason the RM is trying to do a recovery is that it's the standby trying to fill in for the active. Failing to come up defeats the purpose of the HA configuration. Instead of preventing the RM from starting, a failed app recovery should log an error and skip the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)