Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 12 Nov 2015 23:22:11 +0000 (UTC)
From: "zhihai xu (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12912126.1447238905000.56406.1447370531036@Atlassian.JIRA>
In-Reply-To: <JIRA.12912126.1447238905000@Atlassian.JIRA>
References: <JIRA.12912126.1447238905000@Atlassian.JIRA>
 <JIRA.12912126.1447238905099@arcas>
Subject: [jira] [Commented] (YARN-4344) NMs reconnecting with changed
 capabilities can lead to wrong cluster resource calculations
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15003181#comment-15003181 ] 

zhihai xu commented on YARN-4344:
---------------------------------

+1 for Jason Lowe's suggestion to fix the issue at scheduler side. Using {{SchedulerNode.getTotalResource()}} instead of {{RMNode.getTotalCapability()}} inside Scheduler can better decouple Scheduler from RMNodeImpl state machine. It may also fix some other potential issues. For example, {{CapacityScheduler#addNode}} uses {{nodeManager.getTotalCapability()}} after creating {{FiCaSchedulerNode}}, if {{nodeManager.totalCapability}} is changed by RMNodeImpl state machine right after {{FiCaSchedulerNode}} was created, similar issue may happen.

> NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-4344
>                 URL: https://issues.apache.org/jira/browse/YARN-4344
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1, 2.6.2
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>            Priority: Critical
>         Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities, there can arise situations where the overall cluster resource calculation for the cluster will be incorrect leading to inconsistencies in scheduling.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)