hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4679) When work-preserving restart is enabled, the scheduler should wait for the earlier of recovery completion and configured wait time
Date Mon, 08 Feb 2016 18:49:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137460#comment-15137460
] 

Jason Lowe commented on YARN-4679:
----------------------------------

I believe the delay was put in place to avoid the situation where a nodemanager rejoins the
cluster relatively late in the recovery process.  If the scheduler starts allocating based
on stale container state then later discovers a number of other containers already running
on a node it can violate things like absolute max capacities on queues, maximum user limits,
etc.  Since the resourcemanager isn't currently tracking nodes nor containers in the state
store, it doesn't really know when the recovery process is truly complete.  Hence that's why
I thought the delay was originally put in place -- as a workaround to knowing directly when
all previous nodes have reported in.

> When work-preserving restart is enabled, the scheduler should wait for the earlier of
recovery completion and configured wait time
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4679
>                 URL: https://issues.apache.org/jira/browse/YARN-4679
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Karthik Kambatla
>
> When work-preserving restart is enabled, it appears the restart (or failover) is unconditionally
blocked for the configured delay even if the recovery itself finishes sooner than this. This
should be updated to wait for the earlier of the two conditions. Also, it would be nice to
allow setting the config to -1 to indicate wait as long as need for the recovery to be completed.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message