hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Craig Welch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2848) (FICA) Applications should maintain an application specific 'cluster' resource to calculate headroom and userlimit
Date Thu, 13 Nov 2014 03:48:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14209214#comment-14209214

Craig Welch commented on YARN-2848:

bq. Thanks for your explanation, I think it is valid to have such mechanism of course , I
just concerned about the cost.

It sounds like you're under the impression that this is somehow optional/elective - I don't
believe it is.  Until we implement something along these lines we have known defects ( [YARN-1680],
), one way or another, some capability like this needs to be created, or we need to remove
other functionality (headroom, userlimits), or continue to have significant defects/shortcomings
(which is problematic, and imho not really an option)

bq. The pull model you mentioned is isomorphic as the push model (send events to apps, which
we can also add filters to select which apps to send). And wrt pull model, we don't have dedicated
thread for app to do that. And more problematic, if we cannot get apps synchronously handle
such events, we need prepare a event queue for apps to do that.

not at all - as I've mentioned a couple of times, an option is simply to attach an update
indicator to resources which can be compared by the app against it's own to determine if any
action needs to be taken, with the general case expected to be, none.  That's where the efficiency
of the approach comes in.  Of course, the particulars of the implementation are what we need
to work out here, but we do not necessarily have to have event queues, and we certainly don't
need to have the apps synchronously handle events.  It's possible to take those approaches,
but certainly not necessary.

bq. And I think the statement is not always true ... Since it is possible we change labels
on a set of nodes (say 1k nodes), and many applications could run across the 1k nodes, some
operation will scan nodes and build information from scratch, it is a O ( n * m ) operation
in very extreme cases.

if all running applications were interested in a label which changed across all nodes in a
cluster some activity would be necessary for them to make adjustments.  As a rule, this will
be very infrequent in comparison to the frequency of allocation requests in the cluster, which
is the strength of the approach.  Depending on how exactly we model things, it may well not
be necessary for all applications to process all nodes of the cluster individually.  For example,
if we limit nodes to a single label per node then that could be calculated at a cluster level.
 If not, tracking intersection values for label combinations (if limited) could eliminate
the need.  

Putting aside possible shortcuts for a moment, however, I suspect the straightforward approach
of recalculation only when necessary at an application level will actually be fine - it's
possible to posit pathological cases which will be problematic there, but it's possible to
do that with many things.  If the pathological case (a change to labels of interest or nodes
to every application at every allocation heartbeat, or a change to the set of cluster nodes
on every heartbeat...) is not likely and does not need to be supported (it isn't and doesn't...),
then infrequent recalculations only when necessary should not be problematic.  The original
approach on [YARN-1680] would have performed that calculation with every allocation request
- which we rightly took issue with - but doing so only when needed is considered to be a viable
approach (the only realistic one I'm aware of...), which is why we're heading in that direction
- the question is how to do that in detail.  The point of this jira is to note that the blacklist
problem and the node label problem in relation to resources available to the application are
strikingly similar to their needs (they're photo-negatives of one another, effectively...),
and so it makes sense to combine them as it is likely that sharing would build both runtime
and code efficiency.

> (FICA) Applications should maintain an application specific 'cluster' resource to calculate
headroom and userlimit
> ------------------------------------------------------------------------------------------------------------------
>                 Key: YARN-2848
>                 URL: https://issues.apache.org/jira/browse/YARN-2848
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: capacityscheduler
>            Reporter: Craig Welch
>            Assignee: Craig Welch
> Likely solutions to [YARN-1680] (properly handling node and rack blacklisting with cluster
level node additions and removals) will entail managing an application-level "slice" of the
cluster resource available to the application for use in accurately calculating the application
headroom and user limit.  There is an assumption that events which impact this resource will
occur less frequently than the need to calculate headroom, userlimit, etc (which is a valid
assumption given that occurs per-allocation heartbeat).  Given that, the application should
(with assistance from cluster-level code...) detect changes to the composition of the cluster
(node addition, removal) and when those have occurred, calculate an application specific cluster
resource by comparing cluster nodes to it's own blacklist (both rack and individual node).
 I think it makes sense to include nodelabel considerations into this calculation as it will
be efficient to do both at the same time and the single resource value reflecting both constraints
could then be used for efficient frequent headroom and userlimit calculations while remaining
highly accurate.  The application would need to be made aware of nodelabel changes it is interested
in (the application or removal of labels of interest to the application to/from nodes).  For
this purpose, the application submissions's nodelabel expression would be used to determine
the nodelabel impact on the resource used to calculate userlimit and headroom (Cases where
the application elected to request resources not using the application level label expression
are out of scope for this - but for the common usecase of an application which uses a particular
expression throughout, userlimit and headroom would be accurate) This could also provide an
overall mechanism for handling application-specific resource constraints which might be added
in the future.

This message was sent by Atlassian JIRA

View raw message