hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
Date Tue, 31 Mar 2015 09:11:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388285#comment-14388285

zhihai xu commented on YARN-3415:

[~sandyr], that is a very good idea to move the call to setAMResource that's currently in
FairScheduler next to the call to getQueue().addAMResourceUsage().
The new patch YARN-3415.001.patch addressed this issue and it also addressed your first two

[~ragarwal], thanks for the review.
First I want to clarify the AM resource usage won't be changed when the AM container is completed,
It will only be changed when the application attempt is removed from scheduler, which will
call FSLeafQueue#removeApp.
So currently  "Check that AM resource usage becomes 0" is done after all application attempts
are removed.
    assertEquals("Queue1's AM resource usage should be 0",
        0, queue1.getAmResourceUsage().getMemory());

bq. Add a non-AM container to app5. Handle the nodeUpdate event - check that the number of
live containers is 2.
The old code already had this test for app1, the test can pass without the patch.
    // Still can run non-AM container
    createSchedulingRequestExistingApplication(1024, 1, attId1);
    assertEquals("Application1 should have two running containers",
        2, app1.getLiveContainers().size());

I think your issue is due to the non-AM container allocation is delayed after AM container
is finished, which cause 0 LiveContainers.
My test simulates "complete AM container before non-AM container is allocated", the old code
will increase the AM resource usage when non-AM container is allocated. So without the patch,
the test will fail.

> Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
> ---------------------------------------------------------------------------------
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3415.000.patch, YARN-3415.001.patch
> We encountered this problem while running a spark cluster. The amResourceUsage for a
queue became artificially high and then the cluster got deadlocked because the maxAMShare
constrain kicked in and no new AM got admitted to the cluster.
> I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards amResourceUsage
is fragile. It depends on the number of live containers belonging to the app. We saw that
the spark AM went down without explicitly releasing its requested containers and then one
of those containers memory was counted towards amResource.
> cc - [~sandyr]

This message was sent by Atlassian JIRA

View raw message