Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 9 Apr 2015 19:07:12 +0000 (UTC)
From: "Thomas Graves (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12787696.1427984379000.44341.1428606432418@Atlassian.JIRA>
In-Reply-To: <JIRA.12787696.1427984379000@Atlassian.JIRA>
References: <JIRA.12787696.1427984379000@Atlassian.JIRA>
 <JIRA.12787696.1427984379849@arcas>
Subject: [jira] [Commented] (YARN-3434) Interaction between reservations and
 userlimit can result in significant ULF violation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488011#comment-14488011 ] 

Thomas Graves commented on YARN-3434:
-------------------------------------

The code you mention is in the else part of that check where it would do a reservation.  The situation I'm talking about actually allocates a container, not reserve one.  I'll try to explain better:

Application ask for lots of containers. It acquires some containers, then it reserves some. At this point it hits its normal user limit which in my example = capacity.  It hasn't hit the max amount if can allocate or reserved (shouldAllocOrReserveNewContainer()).  The next node heartbeats in that isn't yet reserved and has enough space for it to place a container on.  It first checked in assignContainers -> canAssignToThisQueue.  That passes since we haven't hit max capacity. Then it checks assignContainers -> canAssignToUser. That passes but only because used - reserved < the user limit.  This allows it to continue down into assignContainer.  In assignContainer the node has available space and we haven't hit shouldAllocOrReserveNewContainer(). reservationsContinueLooking is on and labels are empty so it does the check:

{noformat}
if (!shouldAllocOrReserveNewContainer
            || Resources.greaterThan(resourceCalculator, clusterResource,
                minimumUnreservedResource, Resources.none()))
{noformat}

as I said before its allowed to allocate or reserve so it passes that test.  Then it hasn't met its maximum capacity (capacity = 30% and max capacity = 100%) yet so that is None and that check doesn't kick in, so it doesn't go into the block to findNodeToUnreserve().   Then it goes ahead and allocates when it should have needed to unreserve.  Basically we needed to also do the user limit check again and force it to do the findNodeToUnreserve. 


> Interaction between reservations and userlimit can result in significant ULF violation
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-3434
>                 URL: https://issues.apache.org/jira/browse/YARN-3434
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: YARN-3434.patch
>
>
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 containers, each 8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow the userlimit to be surpassed.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)