Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D66C186F5 for ; Wed, 13 Jan 2016 16:35:41 +0000 (UTC) Received: (qmail 30491 invoked by uid 500); 13 Jan 2016 16:35:40 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 30071 invoked by uid 500); 13 Jan 2016 16:35:40 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 29823 invoked by uid 99); 13 Jan 2016 16:35:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jan 2016 16:35:39 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D6B7F2C1F5B for ; Wed, 13 Jan 2016 16:35:39 +0000 (UTC) Date: Wed, 13 Jan 2016 16:35:39 +0000 (UTC) From: "Junping Du (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-4576) Pluggable blacklist/whitelist policies in launching AM to protect AM failed multiple times on problematic nodes MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4576: ----------------------------- Summary: Pluggable blacklist/whitelist policies in launching AM to protect AM failed multiple times on problematic nodes (was: Extend blacklist mechanism to protect AM failed multiple times on failure nodes) > Pluggable blacklist/whitelist policies in launching AM to protect AM failed multiple times on problematic nodes > --------------------------------------------------------------------------------------------------------------- > > Key: YARN-4576 > URL: https://issues.apache.org/jira/browse/YARN-4576 > Project: Hadoop YARN > Issue Type: Improvement > Components: resourcemanager > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > > Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried to launch containers on a specific node get failed for several times, AM will blacklist this node in future resource asking. This mechanism works fine for normal containers. However, from our observation on behaviors of several clusters: if this problematic node launch AM failed, then RM could pickup this problematic node to launch next AM attempts again and again that cause application failure in case other functional nodes are busy. In normal case, the customized healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed. However, in RM side, we can blacklist these nodes for launching AM for a certain time if launching AMs get failed before. -- This message was sent by Atlassian JIRA (v6.3.4#6332)