Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Wed, 21 Oct 2015 05:13:27 +0000 (UTC)
From: "Sangjin Lee (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12906432.1445401349000.14201.1445404407924@Atlassian.JIRA>
In-Reply-To: <JIRA.12906432.1445401349000@Atlassian.JIRA>
References: <JIRA.12906432.1445401349000@Atlassian.JIRA>
 <JIRA.12906432.1445401349820@arcas>
Subject: [jira] [Commented] (YARN-4284) condition for AM blacklisting is too
 narrow
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966248#comment-14966248 ] 

Sangjin Lee commented on YARN-4284:
-----------------------------------

Hi [~sunilg], thanks for the comment. Yes, I've been following the discussion on YARN-2005 as well as YARN-2293. Although it would be nice to have a reliable scoring mechanism as a basis for assigning AM containers, what's implemented in YARN-2005 is actually a pretty solid solution to this problem. By the way, this is one of the more common issues our users encounter.

The only problem with YARN-2005 is that the blacklisting condition is too narrow. In fact, we rarely encounter the DISKS_FAILED error. It's usually more like INVALID (-1000) or other errors. We can try to be real precise and blacklist nodes only if the container exit status is purely due to the node itself and is not caused by the app. But maintaining that precise condition may prove to be brittle.

IMO the key is that blacklisting implemented in YARN-2005 is *per-app*. As such, we can afford to be more aggressive, instead of trying to come up with the 100% accurate blacklisting condition. Since it is per-app, there is no risk one bad app can cause a node to be blacklisted for all other apps (correct me if I'm wrong). Thoughts? Do you see other risk in taking this approach?

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is limited to {{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in blacklisting the nodes on *any failure* (although it might lead to blacklisting the node more aggressively than necessary). I would propose locating the next app attempt to a different node on any failure.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)