hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-230) Make changes for RM restart phase 1
Date Tue, 20 Nov 2012 18:36:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13501361#comment-13501361

Bikas Saha commented on YARN-230:

Attaching 4 patches that break up the whole change for easy reviewing. They wont build on
their own.
1) PB-impl.patch - Classes and PB impl for objects used to store Application and ApplicationAttempt
data. Pretty straightforward code derived from similar code for other PB impls.
2) Store.patch - Implementation of RMStateStore abstract class that interfaces between RM
and real store classes. It translates RM objects like RMAppImpl into PB objects like ApplicationStateData.
It also provides a common implementation of blocking and non-blocking store functions. Non-blocking
operations are performed using an AsyncDispatcher and RMStateStore events that eventually
call implementations of abstract methods which will be provided by real stores. A memory store
is implemented for testing.
3) Test.patch - 1 new test TestRMRestart is a functional test that takes the RM through storing
and recovering state with applications in different states of execution. The flow should be
easy to follow with comments. TestRMAppAttemptTransitions tests for the RMAppAttemptImpl state
machine changes. Other changes are refactoring and addition of helper methods to MockRM etc.
4) Recovery.patch - Implements the proposal in the design doc. RM startup loads old state
and recovers from it if recovery is enabled. Each app that is recovered is submitted to the
RMAppManager so that it re-hydrates all references and tokens like it would normally and then
is ready to start its next attempt. Each recovered application attempt is added to the app's
attempts collection and moved to a RECOVERED state. After that the services, including the
AsyncDispatcher is started which trigger creation of new attempts for the submitted apps.
Existing code in ApplicationMasterService and ResourceTrackerService reboot NM's and previously
running AM's. When a new app is submitted then before replying to the client its data is saved
(blocking). Just before an attempt is launched, its data is saved (non-blocking). For non-blocking
state store new RMAppAttempt states have been added for regular and unmanaged attempts. Once
an app is finished, its data is removed.
> Make changes for RM restart phase 1
> -----------------------------------
>                 Key: YARN-230
>                 URL: https://issues.apache.org/jira/browse/YARN-230
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: YARN-230.1.patch
> As described in YARN-128, phase 1 of RM restart puts in place mechanisms to save application
state and read them back after restart. Upon restart, the NM's are asked to reboot and the
previously running AM's are restarted.
> After this is done, RM HA and work preserving restart can continue in parallel. For more
details please refer to the design document in YARN-128

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message