Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 67463DE59 for ; Fri, 30 Nov 2012 11:42:01 +0000 (UTC) Received: (qmail 49400 invoked by uid 500); 30 Nov 2012 11:42:01 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 49311 invoked by uid 500); 30 Nov 2012 11:42:00 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 49267 invoked by uid 99); 30 Nov 2012 11:41:58 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 30 Nov 2012 11:41:58 +0000 Date: Fri, 30 Nov 2012 11:41:58 +0000 (UTC) From: "Bikas Saha (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: <1826496108.44848.1354275718769.JavaMail.jiratomcat@arcas> In-Reply-To: <74805616.124247.1353082452529.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (YARN-218) Distiguish between "failed" and "killed" app attempts MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13507278#comment-13507278 ] Bikas Saha commented on YARN-218: --------------------------------- Comment from YARN-230 https://issues.apache.org/jira/browse/YARN-230?focusedCommentId=13505427&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13505427 bq. We also need to change the AM retry default to > 1. Otherwise, even with RM restart enabled, the restarted attempts will fail because the previous AM will delete job files. What is your suggestion for that? bq. I think this is where the killed/failed distinction comes in. If the app attempt was killed (because the RM died), then the app will be retried since the first attempt didn't count (from the point of view of yarn.resourcemanager.am.max-retries). This should be taken care of in YARN-218 - does that sound OK to you? This would mean that the AM needs to be notified about it being the last retry from the RM. Currently, the AM reads the info from config and independently makes that decision. This is a problem because if the am.retries is set to 1 then even if the RM does not consider the last attempt as bad, the AM itself will cleanup job data because it thinks its the last retry. > Distiguish between "failed" and "killed" app attempts > ----------------------------------------------------- > > Key: YARN-218 > URL: https://issues.apache.org/jira/browse/YARN-218 > Project: Hadoop YARN > Issue Type: New Feature > Components: resourcemanager > Reporter: Tom White > Assignee: Tom White > > A "failed" app attempt is one that failed due to an error in the user program, as opposed to one that was "killed" by the system. Like in MapReduce task attempts, we should distinguish the two so that killed attempts do not count against the number of retries (yarn.resourcemanager.am.max-retries). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira