Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Tue, 9 Feb 2016 04:34:18 +0000 (UTC)
From: "Karthik Kambatla (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12937579.1454952428000.1646.1454992458177@Atlassian.JIRA>
In-Reply-To: <JIRA.12937579.1454952428000@Atlassian.JIRA>
References: <JIRA.12937579.1454952428000@Atlassian.JIRA>
 <JIRA.12937579.1454952428619@arcas>
Subject: [jira] [Commented] (YARN-4679) When work-preserving restart is
 enabled, the scheduler should wait for the earlier of recovery completion
 and configured wait time
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138333#comment-15138333 ] 

Karthik Kambatla commented on YARN-4679:
----------------------------------------

Thanks Jason. My bad - completely forgot the discussion around this. 

[~jianhe], [~vinodkv] - I vaguely remember us discussing the notion of a threshold for fraction of nodes that were previously connected in addition to this timeout. Do I remember right? Do you think it still makes sense and we can use it as a proxy for recovery completion? 

> When work-preserving restart is enabled, the scheduler should wait for the earlier of recovery completion and configured wait time
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4679
>                 URL: https://issues.apache.org/jira/browse/YARN-4679
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Karthik Kambatla
>
> When work-preserving restart is enabled, it appears the restart (or failover) is unconditionally blocked for the configured delay even if the recovery itself finishes sooner than this. This should be updated to wait for the earlier of the two conditions. Also, it would be nice to allow setting the config to -1 to indicate wait as long as need for the recovery to be completed. 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)