hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4284) condition for AM blacklisting is too narrow
Date Wed, 21 Oct 2015 16:13:28 GMT

    [ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967377#comment-14967377
] 

Sangjin Lee commented on YARN-4284:
-----------------------------------

[~steve_l] [~sunilg], if you have one or two nodes and the AM container of an app fails, {{yarn.am.blacklisting.disable-failure-threshold}}
will ensure that it cannot blacklist the entire cluster for that app. Once you're above the
threshold, the blacklisting is cleared, and all nodes are available. Again, this is a *per-app*
behavior. Other apps are not affected by this decision whatever.

As for the condition for applying blacklisting, I think we can add {{PREEMPTED}} to that list
(for not blacklisting). I'm not so sure about {{KILLED_BY_RESOURCEMANAGER}}. I think it is
possible that an AM container can be killed by the resource manager due to a node issue. Any
failure of heartbeating properly will cause the AM container to be killed by the RM, but the
cause of that failure of heartbeating can be many. Just because it was killed by the RM doesn't
mean definitively that it was purely an app problem. What do you think?

I think we may want to approach this from the point of view of *anti-affinity*. Currently
there is an inherent *affinity* to nodes when it comes to assigning the AM containers. In
my view, anti-affinity is a better behavior as a default behavior. In the worst case scenario
when the AM container failure was caused purely by the app, running subsequent attempts on
different nodes will make it only clear the failures were unrelated to nodes. This helps troubleshooting
a great deal. Today when all AM containers land on the same node, we sometimes spend a fair
amount of time convincing our users that it had nothing to do with the node.

Thoughts and comments are welcome. Thanks!

> condition for AM blacklisting is too narrow
> -------------------------------------------
>
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
>
>
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app
attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is limited to
{{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which
we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in blacklisting
the nodes on *any failure* (although it might lead to blacklisting the node more aggressively
than necessary). I would propose locating the next app attempt to a different node on any
failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message