hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3434) Interaction between reservations and userlimit can result in significant ULF violation
Date Thu, 09 Apr 2015 19:07:12 GMT

    [ https://issues.apache.org/jira/browse/YARN-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14488011#comment-14488011

Thomas Graves commented on YARN-3434:

The code you mention is in the else part of that check where it would do a reservation.  The
situation I'm talking about actually allocates a container, not reserve one.  I'll try to
explain better:

Application ask for lots of containers. It acquires some containers, then it reserves some.
At this point it hits its normal user limit which in my example = capacity.  It hasn't hit
the max amount if can allocate or reserved (shouldAllocOrReserveNewContainer()).  The next
node heartbeats in that isn't yet reserved and has enough space for it to place a container
on.  It first checked in assignContainers -> canAssignToThisQueue.  That passes since we
haven't hit max capacity. Then it checks assignContainers -> canAssignToUser. That passes
but only because used - reserved < the user limit.  This allows it to continue down into
assignContainer.  In assignContainer the node has available space and we haven't hit shouldAllocOrReserveNewContainer().
reservationsContinueLooking is on and labels are empty so it does the check:

if (!shouldAllocOrReserveNewContainer
            || Resources.greaterThan(resourceCalculator, clusterResource,
                minimumUnreservedResource, Resources.none()))

as I said before its allowed to allocate or reserve so it passes that test.  Then it hasn't
met its maximum capacity (capacity = 30% and max capacity = 100%) yet so that is None and
that check doesn't kick in, so it doesn't go into the block to findNodeToUnreserve().   Then
it goes ahead and allocates when it should have needed to unreserve.  Basically we needed
to also do the user limit check again and force it to do the findNodeToUnreserve. 

> Interaction between reservations and userlimit can result in significant ULF violation
> --------------------------------------------------------------------------------------
>                 Key: YARN-3434
>                 URL: https://issues.apache.org/jira/browse/YARN-3434
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.6.0
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: YARN-3434.patch
> ULF was set to 1.0
> User was able to consume 1.4X queue capacity.
> It looks like when this application launched, it reserved about 1000 containers, each
8G each, within about 5 seconds. I think this allowed the logic in assignToUser() to allow
the userlimit to be surpassed.

This message was sent by Atlassian JIRA

View raw message