Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 14 Jul 2017 15:50:00 +0000 (UTC)
From: "Arun Suresh (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.13086437.1499826085000.249156.1500047400259@Atlassian.JIRA>
In-Reply-To: <JIRA.13086437.1499826085000@Atlassian.JIRA>
References: <JIRA.13086437.1499826085000@Atlassian.JIRA> <JIRA.13086437.1499826085174@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (YARN-6808) Allow Schedulers to return
 OPPORTUNISTIC containers when queues go over configured capacity
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 14 Jul 2017 15:50:07 -0000


    [ https://issues.apache.org/jira/browse/YARN-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087500#comment-16087500 ] 

Arun Suresh edited comment on YARN-6808 at 7/14/17 3:49 PM:
------------------------------------------------------------

[~leftnoteasy], good questions..

bq. Use opportunistic container to do lazy preemption in NM. (Is there any umbrella JIRA for this?)
Technically, this is the default behavior for opportunistic containers - as it is today. Opp containers are killed in the NM when a Guaranteed container is started by an AM - if the NM at that point does not have resources to start the guaranteed container. We are also working on YARN-5972 which adds some amount of work preservation to this by - instead of killing the Opp container, we PAUSE it. PAUSE will be supported using the cgroups [freezer|https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt] module in linux and Windows JobObjects (we are using this in production actually)

bq. Let's say app1 in an underutilized queue, which want to preempt containers from an over-utilized queue. Will preemption happens if app1 asks opportunistic container?
I am assuming by under-utilized, you mean starved. So currently, if app1 SPECIFICALLY asks for Opp containers it will get them irrespective of where the queue is underutilized or not. Opp containers ALLOCATION today are not limited by queue/cluster capacity today - It is just limited by the length of queued containers on Nodes (YARN-1011 will in time place stricter capacity limits by allocating only if the allocated resources are not being used). Opp containers EXECUTION is obviously bound by available resources on the NM, and like I mentioned earlier, running Opp containers will be killed to make room for any Guaranteed container.

bq. For target #1, who make the decision of moving guaranteed containers to opportunistic containers. If it is still decided by central RM, does that mean preemption logics in RM are same as today except kill operation is decided by NM side? 
Yes, it is RM. Currently in both the Schedulers, after a container is allocated, candidates for preemption are chosen from containers of apps from queues which are above capacity - then the RM aks the NM to preempt the containers. What the latest patch (002) here does is: Allocation of containers happen in the same code path - but right before handing the container to the AM, it checks if the queue capacity is exceeded - If so, downgrade the container to Opp. Thus technically, the same apps/containers that were a target for normal preemption will become candidates for preemption at the NM. There are obviously improvements - like that I mentioned in the phase 2 of the JIRA in the description - where, in addition to downgrading over cap containers to Opp, we can upgrade running Opp containers to Guaranteed for apps when some of their Guaranteed containers complete.
Like I mentioned, we are still prototyping - we are running tests now to collect data - will keep you guys posted on results.

bq. For overall opportunistic container execution: If OC launch request will be queued by NM, it may wait a long time before get executed. In this case, do we need to modify AM code to: a. expect longer delay before think the launch fails. b. asks more resource on different hosts since there's no guaranteed launch time for OC?
So, with YARN-4597, we had introduced a container state called SCHEDULED. A container is in the scheduled state while it is locallizing or if it is in the queue. Essentially, the extra delay will look just like localization delay to the AM. We have verified this is fine for MapReduce and Spark.

bq. What happens if an app doesn't want to ask opportunistic container when go beyond headroom? (Such as online services). I think this should be a per-app config (give me OC when I'm go beyond headroom).
A per app config makes sense. But currently today, the ResourceRequest has a field called {{ExecutionTypeRequest}} which in addition to the {{ExecutionType}} also has an {{enforeExecutionType}} flag. By default, this is false - but if set to true, my latest patch ensures that only Guaranteed containers are returned. I have added a test case to ensure that as well.

bq. Existing patch makes static decision, which happens when new resource request added by AM. Should this be reconsidered when app's headroom changed over time?
So, my latest patch (002) kind of addresses this. What I do now, is the decision is made after container allocation. Also, now I am ignoring the headroom. I am downgrading if at the time of Container allocation, only if the queue capacity is exceeded. The existing code paths ensure that max-capacity of queues are never exceeded anyway.


was (Author: asuresh):
[~leftnoteasy], good questions..

bq. Use opportunistic container to do lazy preemption in NM. (Is there any umbrella JIRA for this?)
Technically, this is the default behavior for opportunistic containers - as it is today. Opp containers are killed in the NM when a Guaranteed container is started by an AM - if the NM at that point does not have resources to start the guaranteed container. We are also working on YARN-5972 which adds some amount of work preservation to this by - instead of killing the Opp container, we PAUSE it. PAUSE will be supported using the cgroups [freezer|https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt] module in linux and Windows JobObjects (we are using this in production actually)

bq. Let's say app1 in an underutilized queue, which want to preempt containers from an over-utilized queue. Will preemption happens if app1 asks opportunistic container?
I am assuming by under-utilized, you mean starved. So currently, if app1 SPECIFICALLY asks for Opp containers it will get them irrespective of where the queue is underutilized or not. Opp containers ALLOCATION today are not limited by queue/cluster capacity today - It is just limited by the length of queued containers on Nodes (YARN-1011 will in time place stricter capacity limits by allocating only if the allocated resources are not being used). Opp containers EXECUTION is obviously bound by available resources on the NM, and like I mentioned earlier, running Opp containers will be killed to make room for any Guaranteed container.

bq. For target #1, who make the decision of moving guaranteed containers to opportunistic containers. If it is still decided by central RM, does that mean preemption logics in RM are same as today except kill operation is decided by NM side? 
Yes, it is RM. Currently in both the Schedulers, after a container is allocated, candidates for preemption are chosen from containers of apps from queues which are above capacity - then the RM aks the NM to preempt the containers. What the latest patch (002) here does is: Allocation of containers happen in the same code path - but right before handing the container to the AM, it checks if the queue capacity is exceeded - If so, downgrade the container to Opp. Thus technically, the same apps/containers that were a target for normal preemption will become candidates for preemption at the NM. There are obviously improvements - like that I mentioned in the phase 2 of the JIRA in the description - where, in addition to downgrading over cap containers to Opp, we can upgrade running Opp containers to Guaranteed for apps when some of their Guaranteed containers complete.
Like I mentioned, we are still prototyping - we are running tests now to collect data - will keep you guys posted on results.

bq. For overall opportunistic container execution: If OC launch request will be queued by NM, it may wait a long time before get executed. In this case, do we need to modify AM code to: a. expect longer delay before think the launch fails. b. asks more resource on different hosts since there's no guaranteed launch time for OC?
So, with YARN-4597, we had introduced a container state called SCHEDULED. A container is in the scheduled state while it is locallizing or if it is in the queue. Essentially, the extra delay will look just like localization delay to the AM. We have verified this is fine for MapReduce and Spark.

bq. What happens if an app doesn't want to ask opportunistic container when go beyond headroom? (Such as online services). I think this should be a per-app config (give me OC when I'm go beyond headroom).
A per app config makes sense. But we currently today, the ResourceRequest has a field called {{ExecutionTypeRequest}} which in addition to the {{ExecutionType}} also has an {{enforeExecutionType}} flag. By default, this is false - but if to true, my latest patch ensures that only Guaranteed containers are returned. I have added a test case to ensure that as well.

bq. Existing patch makes static decision, which happens when new resource request added by AM. Should this be reconsidered when app's headroom changed over time?
So, my latest patch (002) kind of addresses this. What I do now, is the decision is made after container allocation. Also, now I am ignoring the headroom. I am downgrading if at the time of Container allocation, only if the queue capacity is exceeded. The existing code paths ensure that max-capacity of queues are never exceeded anyway.


> Allow Schedulers to return OPPORTUNISTIC containers when queues go over configured capacity
> -------------------------------------------------------------------------------------------
>
>                 Key: YARN-6808
>                 URL: https://issues.apache.org/jira/browse/YARN-6808
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Arun Suresh
>            Assignee: Arun Suresh
>         Attachments: YARN-6808.001.patch, YARN-6808.002.patch
>
>
> This is based on discussions with [~kasha] and [~kkaranasos].
> Currently, when a Queues goes over capacity, apps on starved queues must wait either for containers to complete or for them to be pre-empted by the scheduler to get resources.
> This JIRA proposes to allow Schedulers to:
> # Allocate all containers over the configured queue capacity/weight as OPPORTUNISTIC.
> # Auto-promote running OPPORTUNISTIC containers of apps as and when their GUARANTEED containers complete.


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org