Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Mon, 25 May 2015 18:22:17 +0000 (UTC)
From: "zhihai xu (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12830303.1431722510000.22986.1432578137775@Atlassian.JIRA>
In-Reply-To: <JIRA.12830303.1431722510000@Atlassian.JIRA>
References: <JIRA.12830303.1431722510000@Atlassian.JIRA>
 <JIRA.12830303.1431722510136@arcas>
Subject: [jira] [Commented] (YARN-3655) FairScheduler: potential livelock
 due to maxAMShare limitation and container reservation
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558428#comment-14558428 ] 

zhihai xu commented on YARN-3655:
---------------------------------

Hi [~kasha], thanks for the review.
bq. 1. okToUnreserve
fixed in the new patch YARN-3655.003.patch

bq. 2. Add an if (isValidReservation) check in FSAppAttempt#reserve so all the reservation logic stays in one place?
IMHO, It is not good to add if (isValidReservation) check in FSAppAttempt#reserve because all the conditions checked in isValidReservation are already checked before we call FSAppAttempt#reserve, it will be duplicate code which will affect the performance.

bq. 3.In {{FSAppAttempt#assignContainer(node, request, nodeType, reserved)}}...
fixed in the new patch YARN-3655.003.patch, In order to remove fitsInMaxShare check, I merged {{fitsInMaxShare}} check into {{hasContainerForNode}}, which also make the code cleaner.

bq. 4. While adding this check in FSAppAttempt#assignContainer(node) might work in practice, it somehow feels out of place. Also, assignReservedContainer could also lead to a reservation?
It looks like assignReservedContainer won't lead to a reservation({{FSAppAttempt#reserve}}), assignReservedContainer won't call {{FSAppAttempt#reserve}} because {{FSAppAttempt#reserve}} will only be called when the node Available Resource is smaller than the requested/reserved resource. assignReservedContainer will only call assignContainer when the node Available Resource is no less than the reserved resource. So only {{FSAppAttempt#assignContainer(node)}} can lead to a reservation when the node Available Resource is smaller than the requested resource.

bq. 5. Instead of calling okToUnreserve/!isValidReservation in FairScheduler#attemptScheduling...
fixed in the new patch YARN-3655.003.patch

bq. 6. Looks like assign-multiple is broken with reserved-containers. The while-loop for assign-multiple should look at both reserved and un-reserved containers assigned. Can we file a follow-up JIRA to fix this?
I suppose you mean assign-multiple is broken after assignReservedContainer turns the reservation into an allocation.
Yes, I created YARN-3710 to fix this issue.

bq. Oh, and I found it hard to understand the test....
fixed in the new patch YARN-3655.003.patch, please review it.

> FairScheduler: potential livelock due to maxAMShare limitation and container reservation 
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-3655
>                 URL: https://issues.apache.org/jira/browse/YARN-3655
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-3655.000.patch, YARN-3655.001.patch, YARN-3655.002.patch, YARN-3655.003.patch
>
>
> FairScheduler: potential livelock due to maxAMShare limitation and container reservation.
> If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node.
> The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen.
> The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock.
> {code}
>     // Check the AM resource usage for the leaf queue
>     if (!isAmRunning() && !getUnmanagedAM()) {
>       List<ResourceRequest> ask = appSchedulingInfo.getAllResourceRequests();
>       if (ask.isEmpty() || !getQueue().canRunAppAM(
>           ask.get(0).getCapability())) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping allocation because maxAMShare limit would " +
>               "be exceeded");
>         }
>         return Resources.none();
>       }
>     }
> {code}
> To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)