hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karthik Kambatla (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2001) Threshold for RM to accept requests from AM after failover
Date Tue, 06 May 2014 21:10:51 GMT

    [ https://issues.apache.org/jira/browse/YARN-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13991138#comment-13991138

Karthik Kambatla commented on YARN-2001:

bq. In a simple case that an application is granted 50% of the cluster resource. The cluster
has 2 nodes. the application used up all its resource quota and launched all containers on
node1. RM fails over and node2 first re-syncs back with RM. 
Only the RM went down and not the AM. The AM continues to know that it is running all its
containers on Node 1, and places request only for additional resources. No? 

bq. Another example would be RM needs to generate new container Id for the new containers
requested from AM. If RM accepts new requests from AM before nodes sync back, the new container
Id may overlap with the Ids of the recovered containers.
I am not sure if [~adhoot]'s prototype includes this, but I think we should use the RM's "cluster"
timestamp in the name of container-ids as well so we precisely know which RM authorized creating
a particular container.

> Threshold for RM to accept requests from AM after failover
> ----------------------------------------------------------
>                 Key: YARN-2001
>                 URL: https://issues.apache.org/jira/browse/YARN-2001
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Jian He
>            Assignee: Jian He
> After failover, RM may require a certain threshold to determine whether it’s safe to
make scheduling decisions and start accepting new container requests from AMs. The threshold
could be a certain amount of nodes. i.e. RM waits until a certain amount of nodes joining
before accepting new container requests.  Or it could simply be a timeout, only after the
timeout RM accepts new requests. 
> NMs joined after the threshold can be treated as new NMs and instructed to kill all its

This message was sent by Atlassian JIRA

View raw message