hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4284) condition for AM blacklisting is too narrow
Date Wed, 21 Oct 2015 05:13:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14966248#comment-14966248

Sangjin Lee commented on YARN-4284:

Hi [~sunilg], thanks for the comment. Yes, I've been following the discussion on YARN-2005
as well as YARN-2293. Although it would be nice to have a reliable scoring mechanism as a
basis for assigning AM containers, what's implemented in YARN-2005 is actually a pretty solid
solution to this problem. By the way, this is one of the more common issues our users encounter.

The only problem with YARN-2005 is that the blacklisting condition is too narrow. In fact,
we rarely encounter the DISKS_FAILED error. It's usually more like INVALID (-1000) or other
errors. We can try to be real precise and blacklist nodes only if the container exit status
is purely due to the node itself and is not caused by the app. But maintaining that precise
condition may prove to be brittle.

IMO the key is that blacklisting implemented in YARN-2005 is *per-app*. As such, we can afford
to be more aggressive, instead of trying to come up with the 100% accurate blacklisting condition.
Since it is per-app, there is no risk one bad app can cause a node to be blacklisted for all
other apps (correct me if I'm wrong). Thoughts? Do you see other risk in taking this approach?

> condition for AM blacklisting is too narrow
> -------------------------------------------
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app
attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is limited to
{{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which
we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in blacklisting
the nodes on *any failure* (although it might lead to blacklisting the node more aggressively
than necessary). I would propose locating the next app attempt to a different node on any

This message was sent by Atlassian JIRA

View raw message