Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C92F791AB for ; Mon, 24 Sep 2012 20:18:09 +0000 (UTC) Received: (qmail 12141 invoked by uid 500); 24 Sep 2012 20:18:09 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 12087 invoked by uid 500); 24 Sep 2012 20:18:09 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 12078 invoked by uid 99); 24 Sep 2012 20:18:09 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2012 20:18:09 +0000 Date: Tue, 25 Sep 2012 07:18:09 +1100 (NCT) From: "Vinod Kumar Vavilapalli (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: <446534344.118433.1348517889677.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (YARN-128) Resurrect RM Restart MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462077#comment-13462077 ] Vinod Kumar Vavilapalli commented on YARN-128: ---------------------------------------------- +1 for most of your points. Some specific comments: bq. What about AM's that completed during restart. Re-running them should be a no-op. AMs should not finish themselves while the RM is down or recovering. They should just spin. bq. How to handle container releases messages that were lost when RM was down? Will AM's get delivery failure and continue to resend indefinitely? You mean release requests from AM? Like above, if AMs just spin, we don't have an issue. bq. Need new AM-RM API to resend asks from AM to RM. See AMResponse.getRebott(). That can be used to inform AMs to resend all details. bq. What information about keys and tokens to persist across restart so that existing secure containers continue to run with new RM and new containers. We already noted this as java comments in code. Need to put in proper documentation. bq. ZK nodes themelves should be secure. Good point. Worst case that ZK doesn't support security, we can rely on a RM specific ZK instance and firewall rules. More requirements: - An upper bound (time) on recovery? - Writing to ZK shouldn't add more than x% (< 1-2%) to app latency? More state to save: - New app submissions should be persisted/accepted but not acted upon during recovery. Miscellaneous points: - I think we should add a new ServiceState call Recovering and use the same in RM. - Overall, clients, AMs and NMs should spin while the RM is down or doing recovery. Also we need to handle fail-over of RM, should do as part of a separate ticket. - When is recovery officially finished? When all running AMs sync up? I suppose so, that would be an upper bound equaling AM-expiry interval. - Need to think of how the RM-NM shared secret roll-over is affected, if RM is down for a significant amount of item > Resurrect RM Restart > --------------------- > > Key: YARN-128 > URL: https://issues.apache.org/jira/browse/YARN-128 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.0.0-alpha > Reporter: Arun C Murthy > Assignee: Bikas Saha > Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt > > > We should resurrect 'RM Restart' which we disabled sometime during the RM refactor. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira