Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Fri, 11 Oct 2013 14:08:48 +0000 (UTC)
From: "Hudson (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12672035.1380750532555.51669.1381500528261@arcas>
In-Reply-To: <JIRA.12672035.1380750532555@arcas>
References: <JIRA.12672035.1380750532555@arcas>
Subject: [jira] [Commented] (YARN-1265) Fair Scheduler chokes on unhealthy
 node reconnect
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13792637#comment-13792637 ] 

Hudson commented on YARN-1265:
------------------------------

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1575 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1575/])
YARN-1265. Fair Scheduler chokes on unhealthy node reconnect (Sandy Ryza) (sandy: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1531146)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java


> Fair Scheduler chokes on unhealthy node reconnect
> -------------------------------------------------
>
>                 Key: YARN-1265
>                 URL: https://issues.apache.org/jira/browse/YARN-1265
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager, scheduler
>    Affects Versions: 2.1.1-beta
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>             Fix For: 2.2.1
>
>         Attachments: YARN-1265-1.patch, YARN-1265.patch
>
>
> Only nodes in the RUNNING state are tracked by schedulers.  When a node reconnects, RMNodeImpl.ReconnectNodeTransition tries to remove it, even if it's in the RUNNING state.  The FairScheduler doesn't guard against this.
> I think the best way to fix this is to check to see whether a node is RUNNING before telling the scheduler to remove it.


--
This message was sent by Atlassian JIRA
(v6.1#6144)