hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-556) RM Restart phase 2 - Work preserving restart
Date Mon, 12 May 2014 17:56:15 GMT

    [ https://issues.apache.org/jira/browse/YARN-556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995328#comment-13995328

Anubhav Dhoot commented on YARN-556:

bq. clustertimestamp is added to containerId so that containerId after RM restart do not clash
with containerId before (as the containerId counter resets to zero in memory). 

The problem is the containerId currently is composed of  ApplicationAttemptId + int. The int
part comes from a in memory containerIdCounter from AppSchedulingInfo. This gets reset after
a RM restart. Without any changes the containerIds for containers allocated after restart
would clash with existing containerIds. 
The prototype proposal is to make it ApplicationAttemptId + uniqueid + int where the uniqueid
can be a timestamp set by RM. I feel containerId should be an opaque string that YARN app
developers don't take a dependency on. Also if we used protobuf serialization/deserialization
rules everywhere we could deal with compatibility changes of different YARN code versions.

> RM Restart phase 2 - Work preserving restart
> --------------------------------------------
>                 Key: YARN-556
>                 URL: https://issues.apache.org/jira/browse/YARN-556
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: Work Preserving RM Restart.pdf, WorkPreservingRestartPrototype.001.patch
> YARN-128 covered storing the state needed for the RM to recover critical information.
This umbrella jira will track changes needed to recover the running state of the cluster so
that work can be preserved across RM restarts.

This message was sent by Atlassian JIRA

View raw message