Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A3C1118A45 for ; Tue, 9 Feb 2016 04:34:18 +0000 (UTC) Received: (qmail 37443 invoked by uid 500); 9 Feb 2016 04:34:18 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 37386 invoked by uid 500); 9 Feb 2016 04:34:18 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 37367 invoked by uid 99); 9 Feb 2016 04:34:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Feb 2016 04:34:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 2C12D2C1F57 for ; Tue, 9 Feb 2016 04:34:18 +0000 (UTC) Date: Tue, 9 Feb 2016 04:34:18 +0000 (UTC) From: "Karthik Kambatla (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4679) When work-preserving restart is enabled, the scheduler should wait for the earlier of recovery completion and configured wait time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138333#comment-15138333 ] Karthik Kambatla commented on YARN-4679: ---------------------------------------- Thanks Jason. My bad - completely forgot the discussion around this. [~jianhe], [~vinodkv] - I vaguely remember us discussing the notion of a threshold for fraction of nodes that were previously connected in addition to this timeout. Do I remember right? Do you think it still makes sense and we can use it as a proxy for recovery completion? > When work-preserving restart is enabled, the scheduler should wait for the earlier of recovery completion and configured wait time > ---------------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-4679 > URL: https://issues.apache.org/jira/browse/YARN-4679 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Reporter: Karthik Kambatla > > When work-preserving restart is enabled, it appears the restart (or failover) is unconditionally blocked for the configured delay even if the recovery itself finishes sooner than this. This should be updated to wait for the earlier of the two conditions. Also, it would be nice to allow setting the config to -1 to indicate wait as long as need for the recovery to be completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)