hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4284) condition for AM blacklisting is too narrow
Date Wed, 21 Oct 2015 16:53:27 GMT

    [ https://issues.apache.org/jira/browse/YARN-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967434#comment-14967434

Sunil G commented on YARN-4284:

Yes [~sjlee0]. After the threshold, it clears all nodes from blacklisting. Thank you for correcting.

bq.Just because it was killed by the RM doesn't mean definitively that it was purely an app
I think yes. It may not be an app specific. 

bq.anti-affinity is a better behavior as a default behavior. In the worst case scenario when
the AM container failure was caused purely by the app, running subsequent attempts on different
nodes will make it only clear the failures were unrelated to nodes
Yes, I agree to your point. It can help to isolate the problem of container failure. So we
could skip only {{PREEMPTED}} for now and consider all other failure cases for blacklisting.

> condition for AM blacklisting is too narrow
> -------------------------------------------
>                 Key: YARN-4284
>                 URL: https://issues.apache.org/jira/browse/YARN-4284
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4284.001.patch
> Per YARN-2005, there is now a way to blacklist nodes for AM purposes so the next app
attempt can be assigned to a different node.
> However, currently the condition under which the node gets blacklisted is limited to
{{DISKS_FAILED}}. There are a whole host of other issues that may cause the failure, for which
we want to locate the AM elsewhere; e.g. disks full, JVM crashes, memory issues, etc.
> Since the AM blacklisting is per-app, there is little practical downside in blacklisting
the nodes on *any failure* (although it might lead to blacklisting the node more aggressively
than necessary). I would propose locating the next app attempt to a different node on any

This message was sent by Atlassian JIRA

View raw message