Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Mon, 27 Jul 2015 17:46:05 +0000 (UTC)
From: "MENG DING (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12691251.1390724105000.299196.1438019165308@Atlassian.JIRA>
In-Reply-To: <JIRA.12691251.1390724105000@Atlassian.JIRA>
References: <JIRA.12691251.1390724105000@Atlassian.JIRA>
 <JIRA.12691251.1390724105382@arcas>
Subject: [jira] [Commented] (YARN-1644) RM-NM protocol changes and
 NodeStatusUpdater implementation to support container resizing
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643099#comment-14643099 ] 

MENG DING commented on YARN-1644:
---------------------------------

bq.  NM re-registration can still happen between the time the increase action is accepted, and the time it's added into increasedContainers. Even startContainer has the same problem, newly started container may fall into this tiny window that RM won't recover this container.
Yes, you are right that startContainer would have the same problem. 
So to make it clear, RM restart/NM re-registration can happen in the following scenarios:
* 1. Container resource increase is already completed. In this case, NM re-registration can send the correct (increased) container size (through containerStatus object) for RM recovery.
* 2. Container to be increased has been added into increasedContainers, but the resource is not yet updated. In this case, NM re-registration can send the correct container size through both containerStatus and increasedContainers objects for RM recovery.
* 3. The increase action is accepted, but the container to be increased has not been added into increasedContainers. In this case, the resource view between NM and RM becomes different. The same issue applies to startContainers.

I don't have a solution for c yet, but I think the chance for scenario 3 to happen is very small, especially with the {{blockNewContainerRequests}} and matching RM identifier logic right now. Maybe we can log a separate JIRA for scenario 3, and fix that for both container increase and container launch?

> RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-1644
>                 URL: https://issues.apache.org/jira/browse/YARN-1644
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>            Reporter: Wangda Tan
>            Assignee: MENG DING
>         Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch
>
>


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)