hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-556) RM Restart phase 2 - Work preserving restart
Date Sat, 19 Apr 2014 01:11:21 GMT

     [ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Anubhav Dhoot updated YARN-556:

    Attachment: WorkPreservingRestartPrototype.001.patch

This prototype is a way to understand the overall design and the major issues that need to
be addressed and minor details that crop up.
This is not a substitute to actual code/unit test for each sub task.
Hopefully this will help a discussion on the approach for overall approach and each sub task.
In this prototype, the following changes are demonstrated. 

	1.	Containers that were running when RM restarted, will continue running
	2.	NM on resync sends the list of running containers as ContainerReport so they provide container
capability (sizes).
	3.	AM on resync reregisters instead of shutting down. AM can make further requests after
RM restart and they are accepted.
	4.	Sample of scheduler changes in FairScheduler. It reregisters the application attempt on
recovery. On NM addNode it adds the containers to that applicationAttempt and charges these
correctly to the application attempt for tracking usage.
	5.	Application and Containers resume their lifecycle with additional transitions to support
continuation after recovery.
	6.	clustertimestamp is added to containerId so that containerId after RM restart do not clash
with containerId before (as the containerId counter resets to zero in memory)
	7.	Changes are controlled by flag.

Not addressed topics  
	1.	Key and token changes
	2.	AM does not resend requests sent before restart yet. So if the RM restarts after AM has
made its request and before RM returns a container, AM is left waiting for allocation. Only
new asks made after RM restart work.
	3.	Completed container status as per design is not handled yet.

Readme for running through the prototype

a) Setup with RM recovery turned on and scheduler set to FairScheduler 
b) Start sleep job with map and reduce such as 
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.0.0-SNAPSHOT.jar
sleep -mt 12000 -rt 12000
c) Restart RM (yarn-daemon.sh  stop/start resourcemanager) and see that containers are not
Following 2 scenarios work
  1. restart rm while reduce is running. reduce continues and then application completes successfully.
Demonstrates continuation of running containers without restart.
  2. restart rm while map is running. map continues and then reduce executes and then application
completes successfully. Demonstrates requesting more resources after restart works in addition
to the previous scenario.

> RM Restart phase 2 - Work preserving restart
> --------------------------------------------
>                 Key: YARN-556
>                 URL: https://issues.apache.org/jira/browse/YARN-556
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch
> YARN-128 covered storing the state needed for the RM to recover critical information.
This umbrella jira will track changes needed to recover the running state of the cluster so
that work can be preserved across RM restarts.

This message was sent by Atlassian JIRA

View raw message