hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4344) NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations
Date Wed, 11 Nov 2015 15:31:11 GMT

    [ https://issues.apache.org/jira/browse/YARN-4344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000521#comment-15000521
] 

Jason Lowe commented on YARN-4344:
----------------------------------

Thanks for the patch, Varun!  I think the change will fix the reported issue, but I'm a bit
skeptical of the vastly different handling of the event based on whether apps are running
or not.  For example, if the http port is changing when the node re-registers, why are we
treating it as a node removal then addition if there aren't any apps running but not if there
are apps running?  Seems like that should be consistent.

Comments on the patch itself:

The comment about sending the node removal event at the start of the main block in the transition
is no longer very accurate.
 
Please don't put large sleeps (on the order of seconds) in tests.  These extra sleep seconds
quickly add up to a significant amount of time over many tests.  If we need to sleep for polling
reasons the sleep should be much shorter, like on the order of 10ms.  Better than sleep-polling
is flushing the event dispatcher and then checking since we can avoid polling entirely.

Nit: isCapabilityChanged init can be simplified to the following, similar to the noRunningApps
boolean init above it:
{code}
      boolean isCapabilityChanged =
          !rmNode.getTotalCapability().equals(newNode.getTotalCapability());
 {code}

Nit: is this conditional check even necessary?  We can just update the total capability with
no semantic effect if it hasn't changed.  Since this is just updating a reference with another
precomputed one, it's not like we're avoiding some expensive code. ;-)
{code}
        if (isCapabilityChanged) {
          rmNode.totalCapability = newNode.getTotalCapability();
        }
{code}

> NMs reconnecting with changed capabilities can lead to wrong cluster resource calculations
> ------------------------------------------------------------------------------------------
>
>                 Key: YARN-4344
>                 URL: https://issues.apache.org/jira/browse/YARN-4344
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.1, 2.6.2
>            Reporter: Varun Vasudev
>            Assignee: Varun Vasudev
>            Priority: Critical
>         Attachments: YARN-4344.001.patch
>
>
> After YARN-3802, if an NM re-connects to the RM with changed capabilities, there can
arise situations where the overall cluster resource calculation for the cluster will be incorrect
leading to inconsistencies in scheduling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message