Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D7B7D17EC9 for ; Sat, 1 Nov 2014 20:17:35 +0000 (UTC) Received: (qmail 13216 invoked by uid 500); 1 Nov 2014 20:17:35 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 13175 invoked by uid 500); 1 Nov 2014 20:17:35 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 13163 invoked by uid 99); 1 Nov 2014 20:17:35 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Nov 2014 20:17:35 +0000 Date: Sat, 1 Nov 2014 20:17:35 +0000 (UTC) From: "Jian He (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14193443#comment-14193443 ] Jian He commented on YARN-2010: ------------------------------- bq. The comments elaborate on potential reasons for ConnectException. The stack trace corresponding to one instance is here - The stack trace should be saying renewHdfsToken on submission is failing.(Supposedly, DFS client should handle retry in case of ConnectException, why not?). IIUC, this {{RMAppManager#recoverApplication}} code-path is not doing any ZK operation. If we can handle the exception coming out of {{getDelegationTokenRenewer().addApplicationSync}} properly. For the purpose of this jira, we don't need the following change, since the problem of {{Any subsequent attempts to transition the RM to active fail because RMActiveServices is not INITED, as in the Standby case}} is already fixed in YARN-2588 {code} // Unable to connect to HDFS or ZK. Assuming this is a transient // issue, we should gracefully shutdown or transition to standby. If // the issue is permanent, there is not much YARN can do. rmContext.getDispatcher().getEventHandler().handle( new RMFatalEvent(RMFatalEventType.CONNECTION_FAILED, ce)); {code} Also, the patch changed the behavior of YARN-2308. YARN-2308 forces RM to exist in case the queue is missing and indicate admin to config the queue properly. The patch changed the behavior to move all apps belonging to the queue to FAILED state if queue is missing. we should not change this? > Handle app-recovery failures gracefully > --------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: YARN-2010.1.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch, yarn-2010-7.patch, yarn-2010-8.patch > > > Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. > The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)