hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4837) User facing aspects of 'AM blacklisting' feature need fixing
Date Fri, 18 Mar 2016 15:41:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15201655#comment-15201655
] 

Sangjin Lee commented on YARN-4837:
-----------------------------------

I just wanted to add my 2 cents to the discussion, specifically about YARN-4284 where we broadened
the cause for blacklisting a node for an AM purpose.

AMs repeatedly getting assigned to the same node in spite of failures is one of the most frequent
complaints from our users ("why did our AMs keep landing on that bad node, causing our jobs
to fail?"). If a node is having a "soft" failure that doesn't quite trip itself over to an
unhealthy state, that's the worst possible case. Since the node is still healthy and appears
to have a lot of available capacity, the chance that it still gets the next attempt is quite
high; i.e. we have node-affinity. And since this is AM, the consequence is much more severe
than when a container landed on that node.

Oftentimes, the cause for this soft failure situation is varied, and trying to come up with
a precise set of exit codes that meet this criteria isn't straightforward. There are even
error codes like INVALID which we see quite often (see [my previous comment|https://issues.apache.org/jira/browse/YARN-4284?focusedCommentId=14966248&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14966248]).
I know it could blacklist the node for the app for reasons such as the app's configuration
error (false positives). However, the reason we could afford to go broad is this blacklisting
is *per-app*. The only downside there is to get assigned to another node.

We have a number of large busy clusters, and we're using this with success and with little
downside.

That said, I do recognize that this could be a problem if {{yarn.resourcemanager.am.max-attempts}}
is larger than the size of the cluster.

> User facing aspects of 'AM blacklisting' feature need fixing
> ------------------------------------------------------------
>
>                 Key: YARN-4837
>                 URL: https://issues.apache.org/jira/browse/YARN-4837
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>
> Was reviewing the user-facing aspects that we are releasing as part of 2.8.0.
> Looking at the 'AM blacklisting feature', I see several things to be fixed before we
release it in 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message