hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-128) Resurrect RM Restart
Date Fri, 16 Nov 2012 14:04:16 GMT

    [ https://issues.apache.org/jira/browse/YARN-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498825#comment-13498825
] 

Tom White commented on YARN-128:
--------------------------------

Bikas, this looks good so far. Thanks for working on it. A few comments:

* Is there a race condition in ResourceManager#recover where RMAppImpl#recover is called after
the StartAppAttemptTransition from resubmitting the app? The problem would be that the earlier
app attempts (from before the resart) would not be the first ones since the new attempt would
get in first.
* I think we need the concept of a 'killed' app attempt (when the system is at fault, not
the app) as well as a 'failed' attempt, like we have in MR task attempts. Without the distinction
a restart will count against the user's app attempts (default 1 retry) which is undesirable.
* Rather than change the ResourceManager constructor, you could read the recoveryEnabled flag
from the configuration.
                
> Resurrect RM Restart 
> ---------------------
>
>                 Key: YARN-128
>                 URL: https://issues.apache.org/jira/browse/YARN-128
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.0.0-alpha
>            Reporter: Arun C Murthy
>            Assignee: Bikas Saha
>         Attachments: MR-4343.1.patch, RM-recovery-initial-thoughts.txt, RMRestartPhase1.pdf,
YARN-128.full-code.3.patch, YARN-128.full-code-4.patch, YARN-128.new-code-added.3.patch, YARN-128.new-code-added-4.patch,
YARN-128.old-code-removed.3.patch, YARN-128.old-code-removed.4.patch, YARN-128.patch
>
>
> We should resurrect 'RM Restart' which we disabled sometime during the RM refactor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message