hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantinos Karanasos (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6344) Rethinking OFF_SWITCH locality in CapacityScheduler
Date Thu, 23 Mar 2017 02:19:41 GMT

    [ https://issues.apache.org/jira/browse/YARN-6344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15937580#comment-15937580
] 

Konstantinos Karanasos commented on YARN-6344:
----------------------------------------------

As I mentioned, in the patch I uploaded, if the new parameter ({{rack-locality-delay}}) is
set to -1, the existing relax locality behavior is achieved.
The new functionality kicks in only for positive values of the parameter.

Since we are at it, it might be useful to discuss if the existing behavior is still desirable.
Consider a cluster of N nodes, and a resource request asking for C containers on L different
locations (where locations is the sum of unique nodes and racks in the request).
Currently, rack assignment happens after node-locality-delay missed opportunities. The way
this works is straightforward.
On the other hand, off-switch assignment happens after L * C / N missed opportunities, capped
by the size of the cluster.
This means that we tend to allow off-switch assignments faster when: (a) the resource request
targets a small number of nodes, (b) there are few containers requested, (c) the cluster is
small, or for a combination thereof.

This seems to work well for apps requesting a big number of containers on a relatively small
cluster. Let's see some examples:
* On a 100-node cluster, requesting 100+ containers, off-switch assignment is dictated by
the size of the cluster. This should be a typical MR application on a common cluster.
* On a 100-node cluster, requesting 5 containers on 2 nodes of a single rack, will lead to
off-switch assignment after 3 * 5 / 100 = 1.5 missed opportunities. This seems too pessimistic.
* On a 2000-node cluster, any combination of L * C > 2000 (which should be the case more
often than not), off-switch assignment happens after a single missed opportunity.
Note that most of our applications fall in the third category.

So it seems that the L * C / N load factor not only is too pessimistic, but it also does not
allow rack assignments, since it kicks in too fast: if off-switch assignment kicks in after
a single missed opportunity, we essentially invalidate rack assignments.
One possible way to mitigate this problem could be to multiply this load factor with the node-locality-delay
when it comes to rack assignments, and with the rack-locality-delay when it comes to off-switch
assignments.
This way we also "relax" the node-locality-delay, increasing the probabilities for a rack
assignment, and we make relax locality not kick in too soon.

But given it might affect the behavior of existing application, I would like to hear your
opinion before making such a change.

> Rethinking OFF_SWITCH locality in CapacityScheduler
> ---------------------------------------------------
>
>                 Key: YARN-6344
>                 URL: https://issues.apache.org/jira/browse/YARN-6344
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>            Reporter: Konstantinos Karanasos
>            Assignee: Konstantinos Karanasos
>         Attachments: YARN-6344.001.patch
>
>
> When relaxing locality from node to rack, the {{node-locality-parameter}} is used: when
scheduling opportunities for a scheduler key are more than the value of this parameter, we
relax locality and try to assign the container to a node in the corresponding rack.
> On the other hand, when relaxing locality to off-switch (i.e., assign the container anywhere
in the cluster), we are using a {{localityWaitFactor}}, which is computed based on the number
of outstanding requests for a specific scheduler key, which is divided by the size of the
cluster. 
> In case of applications that request containers in big batches (e.g., traditional MR
jobs), and for relatively small clusters, the localityWaitFactor does not affect relaxing
locality much.
> However, in case of applications that request containers in small batches, this load
factor takes a very small value, which leads to assigning off-switch containers too soon.
This situation is even more pronounced in big clusters.
> For example, if an application requests only one container per request, the locality
will be relaxed after a single missed scheduling opportunity.
> The purpose of this JIRA is to rethink the way we are relaxing locality for off-switch
assignments.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message