Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 25 Sep 2012 07:18:09 +1100 (NCT)
From: "Vinod Kumar Vavilapalli (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <446534344.118433.1348517889677.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (YARN-128) Resurrect RM Restart
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462077#comment-13462077 ] 

Vinod Kumar Vavilapalli commented on YARN-128:
----------------------------------------------

+1 for most of your points. Some specific comments:

bq. What about AM's that completed during restart. Re-running them should be a no-op.
AMs should not finish themselves while the RM is down or recovering. They should just spin.

bq. How to handle container releases messages that were lost when RM was down? Will AM's get delivery failure and continue to resend indefinitely?
You mean release requests from AM? Like above, if AMs just spin, we don't have an issue.

bq. Need new AM-RM API to resend asks from AM to RM.
See AMResponse.getRebott(). That can be used to inform AMs to resend all details.

bq. What information about keys and tokens to persist across restart so that existing secure containers continue to run with new RM and new containers.
We already noted this as java comments in code. Need to put in proper documentation.

bq. ZK nodes themelves should be secure.
Good point. Worst case that ZK doesn't support security, we can rely on a RM specific ZK instance and firewall rules.

More requirements:
 - An upper bound (time) on recovery?
 - Writing to ZK shouldn't add more than x% (< 1-2%) to app latency?

More state to save:
 - New app submissions should be persisted/accepted but not acted upon during recovery.

Miscellaneous points:
 - I think we should add a new ServiceState call Recovering and use the same in RM.
 - Overall, clients, AMs and NMs should spin while the RM is down or doing recovery. Also we need to handle fail-over of RM, should do as part of a separate ticket.
 - When is recovery officially finished? When all running AMs sync up? I suppose so, that would be an upper bound equaling AM-expiry interval.
 - Need to think of how the RM-NM shared secret roll-over is affected, if RM is down for a significant amount of item

                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira