hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation
Date Thu, 04 Jun 2015 05:20:38 GMT

    [ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572177#comment-14572177

zhihai xu commented on YARN-3655:

Hi [~kasha], thanks for the review.
bq. Is it possible to avoid the checks before the call, and do all the checks in the call.
The reasoning behind this is to have all reservation-related code in as few places as possible.
If this is not possible, we can leave it as the patch has it now.
IMHO, it is not possible because {{FSAppAttempt#reserve}} will only be called from {{assignContainer(node)}},
if we move all the condition checks into {{FSAppAttempt#reserve}}, it may return early, which
will cause failing to reserve or allocate container for other priorities, also since {{assignReservedContainer}}
won't call {{FSAppAttempt#reserve}}, we still need keep {{isValidReservation}} check in {{assignReservedContainer}}.

bq. Instead of adding the check to assignContainer(node) can we add it to assignContainer(node,
request, nodeType, reserved)?
IMHO, It will have problem, if we add it to {{assignContainer(node, request, nodeType, reserved)}},
then {{getAllowedLocalityLevelByTime}}/{{getAllowedLocalityLevel}} will be called before the
check instead of after the check, which will change the Scheduling behavior, also it will
affect the performance(late check will increase CPU usage).

> FairScheduler: potential livelock due to maxAMShare limitation and container reservation

> -----------------------------------------------------------------------------------------
>                 Key: YARN-3655
>                 URL: https://issues.apache.org/jira/browse/YARN-3655
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>         Attachments: YARN-3655.000.patch, YARN-3655.001.patch, YARN-3655.002.patch, YARN-3655.003.patch
> FairScheduler: potential livelock due to maxAMShare limitation and container reservation.
> If a node is reserved by an application, all the other applications don't have any chance
to assign a new container on this node, unless the application which reserves the node assigns
a new container on this node or releases the reserved container on this node.
> The problem is if an application tries to call assignReservedContainer and fail to get
a new container due to maxAMShare limitation, it will block all other applications to use
the nodes it reserves. If all other running applications can't release their AM containers
due to being blocked by these reserved containers. A livelock situation can happen.
> The following is the code at FSAppAttempt#assignContainer which can cause this potential
> {code}
>     // Check the AM resource usage for the leaf queue
>     if (!isAmRunning() && !getUnmanagedAM()) {
>       List<ResourceRequest> ask = appSchedulingInfo.getAllResourceRequests();
>       if (ask.isEmpty() || !getQueue().canRunAppAM(
>           ask.get(0).getCapability())) {
>         if (LOG.isDebugEnabled()) {
>           LOG.debug("Skipping allocation because maxAMShare limit would " +
>               "be exceeded");
>         }
>         return Resources.none();
>       }
>     }
> {code}
> To fix this issue, we can unreserve the node if we can't allocate the AM container on
the node due to Max AM share limitation and the node is reserved by the application.

This message was sent by Atlassian JIRA

View raw message