Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B8BFB200AF7 for ; Tue, 10 May 2016 07:54:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B5790160A0F; Tue, 10 May 2016 05:54:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 126F21609A8 for ; Tue, 10 May 2016 07:54:13 +0200 (CEST) Received: (qmail 67077 invoked by uid 500); 10 May 2016 05:54:13 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 67025 invoked by uid 99); 10 May 2016 05:54:13 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 May 2016 05:54:13 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E6C3B2C1F69 for ; Tue, 10 May 2016 05:54:12 +0000 (UTC) Date: Tue, 10 May 2016 05:54:12 +0000 (UTC) From: "Rohith Sharma K S (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5063) Fail to launch AM continuously on a lost NM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 10 May 2016 05:54:14 -0000 [ https://issues.apache.org/jira/browse/YARN-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277664#comment-15277664 ] Rohith Sharma K S commented on YARN-5063: ----------------------------------------- Thanks for clarifying my doubts. I think this is very good scenario for AM node blacklisting to be consider. Keeping cc:/ [~sunilg] [~vvasudev] [~vinodkv] let also they know about scenario. As I said, there are design level issue in YARN-2005, so need to wait for proper solution. Ex : Node is not reachable so that node is blacklisted. Since other nodes are busy, the same node comes back(registered) where in new node is not considered for allocation. See YARN-4685 > Fail to launch AM continuously on a lost NM > ------------------------------------------- > > Key: YARN-5063 > URL: https://issues.apache.org/jira/browse/YARN-5063 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Jun Gong > Assignee: Jun Gong > > If a NM node shuts down, RM will not mark it as LOST until liveness monitor finds it timeout. However before that, RM might continuously allocate AM on that NM. > We found this case in our cluster: RM continuously allocated a same AM on a lost NM before RM found it lost, and AMLauncher always failed because it could not connect to the lost NM. To solve the problem, we could add the NM to AM blacklist if RM failed to launch it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org