Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1570C102D2 for ; Thu, 17 Oct 2013 21:54:39 +0000 (UTC) Received: (qmail 65698 invoked by uid 500); 17 Oct 2013 21:53:58 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 65619 invoked by uid 500); 17 Oct 2013 21:53:50 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 65568 invoked by uid 99); 17 Oct 2013 21:53:44 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Oct 2013 21:53:44 +0000 Date: Thu, 17 Oct 2013 21:53:44 +0000 (UTC) From: "Omkar Vinit Joshi (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1210) During RM restart, RM should start a new attempt only when previous attempt exits for real MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798485#comment-13798485 ] Omkar Vinit Joshi commented on YARN-1210: ----------------------------------------- taking it over. > During RM restart, RM should start a new attempt only when previous attempt exits for real > ------------------------------------------------------------------------------------------ > > Key: YARN-1210 > URL: https://issues.apache.org/jira/browse/YARN-1210 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Vinod Kumar Vavilapalli > Assignee: Jian He > > When RM recovers, it can wait for existing AMs to contact RM back and then kill them forcefully before even starting a new AM. Worst case, RM will start a new AppAttempt after waiting for 10 mins ( the expiry interval). This way we'll minimize multiple AMs racing with each other. This can help issues with downstream components like Pig, Hive and Oozie during RM restart. > In the mean while, new apps will proceed as usual as existing apps wait for recovery. > This can continue to be useful after work-preserving restart, so that AMs which can properly sync back up with RM can continue to run and those that don't are guaranteed to be killed before starting a new attempt. -- This message was sent by Atlassian JIRA (v6.1#6144)