hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tom White (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-230) Make changes for RM restart phase 1
Date Fri, 23 Nov 2012 15:34:59 GMT

    [ https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13503228#comment-13503228

Tom White commented on YARN-230:

Overall, this looks great. General feedback:

* Can we make application removal atomic? If the RM shuts down after a completed application
is removed from the state store, but before the app attempts are removed from the store, then
the app attempts may be orphaned. (There's a comment about it in FileSystemRMStateStore, but
no action is taken so the attempt files will remain in the store.) It might be better to make
RMStateStore#removeApplicationState responsible for removing the app attempts (i.e. remove
removeApplicationAttemptState). This would solve the orphaning problem, and it would also
make it possible to store the app attempts in a directory nested under the application directory,
which would be nicer from a scaling point of view, and also for someone having to debug the
state on the filesystem.
* If the RM shuts down before a (successful) completed application is removed from the state
store, will it be rerun on restart, or will the fact that a successful app attempt was stored
mean that it doesn't need to? Obviously, the second one would be preferable.
* The exceptions thrown by the public methods of RMStateStore should be more specific than
* Let's have a default for yarn.resourcemanager.store.class in yarn-default.xml. StoreFactory
has MemoryRMStateStore as the default, but that's not useful when running on a cluster; FileSystemRMStateStore
would be better. Similarly it would be good to have the default location for the store be
a system directory on the default file system. With these two changes folks would only need
to set yarn.resourcemanager.recovery.enabled to true to enable recovery. (We might make that
enabled by default at some point too.)
* MemoryRMStateStore#removeApplicationState will fail if asserts are disabled: the remove
method should be called in a separate statement and assigned to a variable which can be checked
in the assert. It's worth checking if this problem exists elsewhere.
* Naming nit: Store was renamed to RMStateStore, but so StoreFactory should be renamed to
* Naming nit: zk.rm-state-store rather than zk.rmstatestore for consistency with other property
names. Also for fs.rmstatestore, and zk.rmstatestore.parentpath (parent-path).

> Make changes for RM restart phase 1
> -----------------------------------
>                 Key: YARN-230
>                 URL: https://issues.apache.org/jira/browse/YARN-230
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: PB-impl.patch, Recovery.patch, Store.patch, Test.patch, YARN-230.1.patch
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to save application
state and read them back after restart. Upon restart, the NM's are asked to reboot and the
previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in parallel. For more
details please refer to the design document in YARN-128

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message