Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82F0918C0F for ; Wed, 21 Oct 2015 05:13:28 +0000 (UTC) Received: (qmail 6268 invoked by uid 500); 21 Oct 2015 05:13:28 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 6222 invoked by uid 500); 21 Oct 2015 05:13:28 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 6210 invoked by uid 99); 21 Oct 2015 05:13:28 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Oct 2015 05:13:28 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E26E42C1F69 for ; Wed, 21 Oct 2015 05:13:27 +0000 (UTC) Date: Wed, 21 Oct 2015 05:13:27 +0000 (UTC) From: "Sangjin Lee (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4284) condition for AM blacklisting is too narrow MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966248#comment-14966248 ] Sangjin Lee commented on YARN-4284: ----------------------------------- Hi [~sunilg], thanks for the comment. Yes, I've been following the discussion on YARN-2005 as well as YARN-2293. Although it would be nice to have a reliable scoring mechanism as a basis for assigning AM containers, what's implemented in YARN-2005 is actually a pretty solid solution to this problem. By the way, this is one of the more common issues our users encounter. The only problem with YARN-2005 is that the blacklisting condition is too narrow. In fact, we rarely encounter the DISKS_FAILED error. It's usually more like INVALID (-1000) or other errors. We can try to be real precise and blacklist nodes only if the container exit status is purely due to the node itself and is not caused by the app. But maintaining that precise condition may prove to be brittle. IMO the key is that blacklisting implemented in YARN-2005 is *per-app*. As such, we can afford to be more aggressive, instead of trying to come up with the 100% accurate blacklisting condition. Since it is per-app, there is no risk one bad app can cause a node to be blacklisted for all other apps (correct me if I'm wrong). Thoughts? Do you see other risk in taking this approach? > condition for AM blacklisting is too narrow > ------------------------------------------- > > Key: YARN-4284 > URL: https://issues.apache.org/jira/browse/YARN-4284 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.8.0 > Reporter: Sangjin Lee > Assignee: Sangjin Lee > Attachments: YARN-4284.001.patch > > > Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app attempt can be assigned to a different node. > However, currently the condition under which the node gets blacklisted is limited to {{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc. > Since the AM blacklisting is per-app, there is little practical downside in blacklisting the nodes on *any failure* (although it might lead to blacklisting the node more aggressively than necessary). I would propose locating the next app attempt to a different node on any failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)