hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3849) Too much of preemption activity causing continuos killing of containers across queues
Date Fri, 26 Jun 2015 15:04:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-3849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14603005#comment-14603005
] 

Sunil G commented on YARN-3849:
-------------------------------

Thank you [~leftnoteasy] and [~kasha@cloudera.com]

[~kasha] , we have tested this only in CS. And the issue is looking like in DominentResourceCalculator.
I will analyze whether this will happen in Fair.

[~leftnoteasy], I have understood your point. I can explain you the scenario based on few
key code snippets.
Please feel free to point out if any issues in my analysis.

CSQueueUtils#updateUsedCapacity has below code to calculate absoluteUsedCapacity.
{code}
absoluteUsedCapacity =
          Resources.divide(rc, totalPartitionResource, usedResource,
              totalPartitionResource); 
{code}

This will result a call to DominentResourceCalculator, 
{code}
  public float divide(Resource clusterResource, 
      Resource numerator, Resource denominator) {
    return 
        getResourceAsValue(clusterResource, numerator, true) / 
        getResourceAsValue(clusterResource, denominator, true);
{code}

In our cluster, the resource allocation is as follows
usedResource   <10Gb, 95Cores>
totalPartitionResource  <100Gb, 100Cores>.

Since we use dominence, absoluteUsedCapacity will come close to 1 eventhough Memory is used
only 10%.


IN ProportionalPreemptionPolicy, we use like below
{code}
float absUsed = qc.getAbsoluteUsedCapacity(partitionToLookAt);
Resource current = Resources.multiply(partitionResource, absUsed);
{code} 

So *current - guaranteed* will give us tobePreempted which will be close to <50GB, 45Cores>.
Actually here memory should have been 5Gb.
Now in our cluster, each container is of <1Gb, 10Cores>. 
Hence the *cores* will be 0 after 5 container kills, but tobePreempted will still have memory.
And as mentioned in above comment, preemption will continue to kill other containers too.

> Too much of preemption activity causing continuos killing of containers across queues
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-3849
>                 URL: https://issues.apache.org/jira/browse/YARN-3849
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.7.0
>            Reporter: Sunil G
>            Assignee: Sunil G
>            Priority: Critical
>
> Two queues are used. Each queue has given a capacity of 0.5. Dominant Resource policy
is used.
> 1. An app is submitted in QueueA which is consuming full cluster capacity
> 2. After submitting an app in QueueB, there are some demand  and invoking preemption
in QueueA
> 3. Instead of killing the excess of 0.5 guaranteed capacity, we observed that all containers
other than AM is getting killed in QueueA
> 4. Now the app in QueueB is trying to take over cluster with the current free space.
But there are some updated demand from the app in QueueA which lost its containers earlier,
and preemption is kicked in QueueB now.
> Scenario in step 3 and 4 continuously happening in loop. Thus none of the apps are completing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message