Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 14 May 2015 23:19:02 +0000 (UTC)
From: "Wangda Tan (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12784835.1427122951000.121398.1431645542693@Atlassian.JIRA>
In-Reply-To: <JIRA.12784835.1427122951000@Atlassian.JIRA>
References: <JIRA.12784835.1427122951000@Atlassian.JIRA>
 <JIRA.12784835.1427122951884@arcas>
Subject: [jira] [Commented] (YARN-3388) Allocation in LeafQueue could get
 stuck because DRF calculator isn't well supported when computing user-limit
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544602#comment-14544602 ] 

Wangda Tan commented on YARN-3388:
----------------------------------

Thanks updating [~nroberts], took at look at latest patch, some comments:
1) It may be better to rename rbl to partitionResource in a couple of places, rbl is not a very clear name to me.

2) One bigger problem is, updateClusterResource only considered NO_LABEL, but computeUserLimit uses getUsageRatio for all partitions. It will be inaccurate if resource of partition updated.
Solution could be:
a. Only use getUsageRatio when partition=NO_LABEL
b. Recomputes all partitions when updateClusterResource.

I prefer b since other code path in your patch are all considered partitions. You can take a look at CSQueueUtils#updateQueueStatistics, they should have very similar logic to handle partitions when cluster resource updates.

3) It's better not put the user-usage-ratio in ResourceUsage, ResourceUsage is targeting to track common resources for user/app/queue. I suggest to create a ResourceUsage-like structure in LeafQueue, and User/LeafQueue will share it. 

4) Better to split and rename User.updateUsageRatio to User.updateAndGetDeltaOfDominateResourceRatio and User.updateAndGetDominateResourceRatio, the "reset" parameter is not very straight-forward to me. 


> Allocation in LeafQueue could get stuck because DRF calculator isn't well supported when computing user-limit
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3388
>                 URL: https://issues.apache.org/jira/browse/YARN-3388
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>         Attachments: YARN-3388-v0.patch, YARN-3388-v1.patch, YARN-3388-v2.patch
>
>
> When there are multiple active users in a queue, it should be possible for those users to make use of capacity up-to max_capacity (or close). The resources should be fairly distributed among the active users in the queue. This works pretty well when there is a single resource being scheduled.   However, when there are multiple resources the situation gets more complex and the current algorithm tends to get stuck at Capacity. 
> Example illustrated in subsequent comment.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)