hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-128) Resurrect RM Restart
Date Mon, 24 Sep 2012 20:18:09 GMT

    [ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13462077#comment-13462077

Vinod Kumar Vavilapalli commented on YARN-128:

+1 for most of your points. Some specific comments:

bq. What about AM's that completed during restart. Re-running them should be a no-op.
AMs should not finish themselves while the RM is down or recovering. They should just spin.

bq. How to handle container releases messages that were lost when RM was down? Will AM's get
delivery failure and continue to resend indefinitely?
You mean release requests from AM? Like above, if AMs just spin, we don't have an issue.

bq. Need new AM-RM API to resend asks from AM to RM.
See AMResponse.getRebott(). That can be used to inform AMs to resend all details.

bq. What information about keys and tokens to persist across restart so that existing secure
containers continue to run with new RM and new containers.
We already noted this as java comments in code. Need to put in proper documentation.

bq. ZK nodes themelves should be secure.
Good point. Worst case that ZK doesn't support security, we can rely on a RM specific ZK instance
and firewall rules.

More requirements:
 - An upper bound (time) on recovery?
 - Writing to ZK shouldn't add more than x% (< 1-2%) to app latency?

More state to save:
 - New app submissions should be persisted/accepted but not acted upon during recovery.

Miscellaneous points:
 - I think we should add a new ServiceState call Recovering and use the same in RM.
 - Overall, clients, AMs and NMs should spin while the RM is down or doing recovery. Also
we need to handle fail-over of RM, should do as part of a separate ticket.
 - When is recovery officially finished? When all running AMs sync up? I suppose so, that
would be an upper bound equaling AM-expiry interval.
 - Need to think of how the RM-NM shared secret roll-over is affected, if RM is down for a
significant amount of item

> Resurrect RM Restart 
> ---------------------
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt
> We should resurrect 'RM Restart' which we disabled sometime during the RM refactor.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message