Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 93F191839F for ; Mon, 27 Jul 2015 17:46:05 +0000 (UTC) Received: (qmail 62361 invoked by uid 500); 27 Jul 2015 17:46:05 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 62321 invoked by uid 500); 27 Jul 2015 17:46:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 62310 invoked by uid 99); 27 Jul 2015 17:46:05 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2015 17:46:05 +0000 Date: Mon, 27 Jul 2015 17:46:05 +0000 (UTC) From: "MENG DING (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-1644) RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643099#comment-14643099 ] MENG DING commented on YARN-1644: --------------------------------- bq. NM re-registration can still happen between the time the increase action is accepted, and the time it's added into increasedContainers. Even startContainer has the same problem, newly started container may fall into this tiny window that RM won't recover this container. Yes, you are right that startContainer would have the same problem. So to make it clear, RM restart/NM re-registration can happen in the following scenarios: * 1. Container resource increase is already completed. In this case, NM re-registration can send the correct (increased) container size (through containerStatus object) for RM recovery. * 2. Container to be increased has been added into increasedContainers, but the resource is not yet updated. In this case, NM re-registration can send the correct container size through both containerStatus and increasedContainers objects for RM recovery. * 3. The increase action is accepted, but the container to be increased has not been added into increasedContainers. In this case, the resource view between NM and RM becomes different. The same issue applies to startContainers. I don't have a solution for c yet, but I think the chance for scenario 3 to happen is very small, especially with the {{blockNewContainerRequests}} and matching RM identifier logic right now. Maybe we can log a separate JIRA for scenario 3, and fix that for both container increase and container launch? > RM-NM protocol changes and NodeStatusUpdater implementation to support container resizing > ----------------------------------------------------------------------------------------- > > Key: YARN-1644 > URL: https://issues.apache.org/jira/browse/YARN-1644 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Reporter: Wangda Tan > Assignee: MENG DING > Attachments: YARN-1644-YARN-1197.4.patch, YARN-1644-YARN-1197.5.patch, YARN-1644.1.patch, YARN-1644.2.patch, YARN-1644.3.patch, yarn-1644.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)