Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8DF6F17ADF for ; Wed, 5 Nov 2014 15:15:39 +0000 (UTC) Received: (qmail 6286 invoked by uid 500); 5 Nov 2014 15:15:39 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 6246 invoked by uid 500); 5 Nov 2014 15:15:39 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 6233 invoked by uid 99); 5 Nov 2014 15:15:39 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 05 Nov 2014 15:15:39 +0000 Date: Wed, 5 Nov 2014 15:15:39 +0000 (UTC) From: "Hudson (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-2010) Handle app-recovery failures gracefully MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-2010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198487#comment-14198487 ] Hudson commented on YARN-2010: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk #1948 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1948/]) YARN-2010. Handle app-recovery failures gracefully. (Jian He and Karthik Kambatla via kasha) (kasha: rev b2cd2698028118b6384904732dbf94942f644732) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/TestRMAppTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMAppManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppRecoverEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueNotFoundException.java > Handle app-recovery failures gracefully > --------------------------------------- > > Key: YARN-2010 > URL: https://issues.apache.org/jira/browse/YARN-2010 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.3.0 > Reporter: bc Wong > Assignee: Karthik Kambatla > Priority: Blocker > Fix For: 2.6.0 > > Attachments: YARN-2010.1.patch, YARN-2010.12.patch, YARN-2010.12.patch, YARN-2010.14.patch, YARN-2010.15.patch, YARN-2010.16.patch, YARN-2010.patch, issue-stacktrace.rtf, yarn-2010-10.patch, yarn-2010-11.patch, yarn-2010-13.patch, yarn-2010-2.patch, yarn-2010-3.patch, yarn-2010-3.patch, yarn-2010-4.patch, yarn-2010-5.patch, yarn-2010-6.patch, yarn-2010-7.patch, yarn-2010-8.patch, yarn-2010-9.patch > > > Sometimes, the RM fails to recover an application. It could be because of turning security on, token expiry, or issues connecting to HDFS etc. The causes could be classified into (1) transient, (2) specific to one application, and (3) permanent and apply to multiple (all) applications. Today, the RM fails to transition to Active and ends up in STOPPED state and can never be transitioned to Active again. > The initial stacktrace reported is at https://issues.apache.org/jira/secure/attachment/12676476/issue-stacktrace.rtf -- This message was sent by Atlassian JIRA (v6.3.4#6332)