hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5074) RM cycles through container ids for an app that is waiting for resources.
Date Wed, 11 May 2016 23:54:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-5074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15281017#comment-15281017

Wangda Tan commented on YARN-5074:

Thanks [~sidharta-s] reporting this issue.  

This happens when:

- A multiple nodes cluster (# >= 2)
- App1 takes almost of the cluster
- AM Request of app2 can be reserved but cannot get allocated
- If app2 get resource from a different node other than reserved node (IAW, reservation cancellation
happens). App2 can get a container-id with number > 1.

>From what I can see, there're two issues that container id could be skipped when works
with reservation-continuous-looking:

*Issue#1, multiple containerId will be skipped*
In LeafQueue#assignContainer 
    // Create the container if necessary
    Container container = 
        getContainer(rmContainer, application, node, capability, priority);

Happens before successfully allocate or reserve container.

So if LeafQueue relaxed checks considered reserved resource, it is possible that unnecessary
getContainer call happens.

This issue only exists in branch-2.7. Branch-2.8/branch-2/trunk will not create containerId
unless it allocate or reserve new container.

*Issue#2, single container id will be skipped:*
This issue exists in both of branch-2.7 and branch-2.8+.

When one container (c1) is reserved at host1, and later it is cancelled to allocate another
container (c2) at a different host, containerId of c1 will be skipped.

Uploading a demo test to reproduce this issue in branch-2.7:

> RM cycles through container ids for an app that is waiting for resources. 
> --------------------------------------------------------------------------
>                 Key: YARN-5074
>                 URL: https://issues.apache.org/jira/browse/YARN-5074
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.2
>            Reporter: Sidharta Seethana
>         Attachments: YARN-5074-test-case.patch
> /cc [~wangda], [~vinodkv]
> This was observed on a cluster running a 2.7.x build. Here is the scenario :
> 1. A YARN cluster has applications running that almost entirely consume the cluster,
with little available resources.
> 2. A new app is submitted - the resources required for the AM exceed what is available
in the cluster. The app stays in the 'ACCEPTED' state till resources are available.
> 3. Once resources are available and the AM container comes up, the AM container has a
id that indicates that the RM has been cycling through containers. There are no errors in
the logs of any kind. One example id for such an AM container is : container_e3788_1462916288781_0012_01_000302
. This indicates that while the app was in the 'ACCEPTED' state, the RM cycled through 301

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message