hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps
Date Tue, 17 Jul 2018 22:22:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547161#comment-16547161
] 

Eric Payne commented on YARN-4606:
----------------------------------

Thank you, [~manirajv06@gmail.com], for the latest patch.

The code changes look good. However, I have a couple of points with the tests.

- I have a general concern that these tests are not testing the fix to the starvation problem
outlined in the description of this JIRA. I'm trying to determine if there is a clean way
to unit test that use case.
- {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am concerned about
new tests that take longer than necessary because the unit tests keep taking longer and longer
to run. I think that the following things can be done to reduce this test time (in my build
environment) from 1min 17sec to 24 sec.
-- In the following code, the sleep(5000) outside of the for loop is not necessary.
-- In the following code, the sleep(5000) inside of the for loop could be cut down to sleep(500).
{code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}
    Thread.sleep(5000);

    //Triggering this event so that user limit computation can
    //happen again
    for (int i = 0; i < 10; i++) {
      cs.handle(new NodeUpdateSchedulerEvent(rmNode1));
      Thread.sleep(5000);
   }
{code}

- {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I don't think
this test is necessary. It takes more than 1:20 to run in my build environment, and as far
as I can tell, it is verifying that the active users count is not ever updated after a move
if node heartbeats are not received. However, in a running YARN installation, node heartbeats
are received every second (by default). Unless I'm missing something, this isn't a use case
that one would encounter in a running Hadoop system.
- {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The parameters to
{{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User limit factor can be thought
of as the multiplier for the amount of a queue that one user can consume. So, if the user
limit factor is 1.0f, one user can use the capacity of the queue. If it is 2.0f, one user
can use twice the capacity of the queue, and so forth. Since these queues have a capacity
of 50%, I would set this to 2.0f.


> CapacityScheduler: applications could get starved because computation of #activeUsers
considers pending apps 
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4606
>                 URL: https://issues.apache.org/jira/browse/YARN-4606
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, capacityscheduler
>    Affects Versions: 2.8.0, 2.7.1
>            Reporter: Karam Singh
>            Assignee: Manikandan R
>            Priority: Critical
>         Attachments: YARN-4606.001.patch, YARN-4606.002.patch, YARN-4606.003.patch, YARN-4606.004.patch,
YARN-4606.005.patch, YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.3.patch,
YARN-4606.POC.patch
>
>
> Currently, if all applications belong to same user in LeafQueue are pending (caused by
max-am-percent, etc.), ActiveUsersManager still considers the user is an active user. This
could lead to starvation of active applications, for example:
> - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to user3)/app4(belongs
to user4) are pending
> - ActiveUsersManager returns #active-users=4
> - However, there're only two users (user1/user2) are able to allocate new resources.
So computed user-limit-resource could be lower than expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message