hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2005) Blacklisting support for scheduling AMs
Date Wed, 28 Jan 2015 14:22:37 GMT

    [ https://issues.apache.org/jira/browse/YARN-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295171#comment-14295171

Jason Lowe commented on YARN-2005:

bq.  App name is the first point came in to my thoughts.

The problem with app name in the workflow spamming case is that many workflows I've seen use
a different app name each time they submit, since the app name often includes some timestamp
indicating which data window it's consuming/producing.  If the workflow is retrying the same
failed apps then the app name may not be changing, but if it's plowing ahead submitting other
jobs then it very likely is changing.

bq. If an app from "user1" with name "job2" fails on node1, it is very much appropriate to
try its second attempt in a different node.

Totally agree.  I think it's worthwhile to consider implementing a relatively simple app-specific
blacklisting logic to avoid this fairly common scenario.  We can then follow that up with
a much more sophisticated blacklisting algortihm with fancy weighting with time decays, etc.,
but the biggest problem we're seeing probably doesn't need anything that fancy to solve 80%
of the cases we see.

bq. I feel i could jot down few points and share as a doc for same

Sounds good, feel free to post one.

> Blacklisting support for scheduling AMs
> ---------------------------------------
>                 Key: YARN-2005
>                 URL: https://issues.apache.org/jira/browse/YARN-2005
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 0.23.10, 2.4.0
>            Reporter: Jason Lowe
> It would be nice if the RM supported blacklisting a node for an AM launch after the same
node fails a configurable number of AM attempts.  This would be similar to the blacklisting
support for scheduling task attempts in the MapReduce AM but for scheduling AM attempts on
the RM side.

This message was sent by Atlassian JIRA

View raw message