Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8D7810B13 for ; Thu, 26 Feb 2015 09:38:26 +0000 (UTC) Received: (qmail 4353 invoked by uid 500); 26 Feb 2015 09:38:04 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 4305 invoked by uid 500); 26 Feb 2015 09:38:04 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 4293 invoked by uid 99); 26 Feb 2015 09:38:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 26 Feb 2015 09:38:04 +0000 Date: Thu, 26 Feb 2015 09:38:04 +0000 (UTC) From: "Chengbing Liu (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3266) RMContext inactiveNodes should have NodeId as map key MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated YARN-3266: -------------------------------- Attachment: YARN-3266.01.patch > RMContext inactiveNodes should have NodeId as map key > ----------------------------------------------------- > > Key: YARN-3266 > URL: https://issues.apache.org/jira/browse/YARN-3266 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Chengbing Liu > Assignee: Rohith > Attachments: YARN-3266.01.patch > > > Under the default NM port configuration, which is 0, we have observed in the current version, "lost nodes" count is greater than the length of the lost node list. This will happen when we consecutively restart the same NM twice: > * NM started at port 10001 > * NM restarted at port 10002 > * NM restarted at port 10003 > * NM:10001 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=1; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} has 1 element > * NM:10002 timeout, {{ClusterMetrics#incrNumLostNMs()}}, # lost node=2; {{rmNode.context.getInactiveRMNodes().put(rmNode.nodeId.getHost(), rmNode)}}, {{inactiveNodes}} still has 1 element > Since we allow multiple NodeManagers on one host (as discussed in YARN-1888), {{inactiveNodes}} should be of type {{ConcurrentMap}}. If this will break the current API, then the key string should include the NM's port as well. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)