hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tao Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app
Date Tue, 30 Oct 2018 17:45:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669111#comment-16669111
] 

Tao Yang commented on YARN-8958:
--------------------------------

Attached v2 patch to fix UT failures
(1) Set {{yarn.resourcemanager.store.class=org.apache.hadoop.yarn.server.resourcemanager.recovery.MemoryRMStateStore}}
for TestFairOrderingPolicy#testSchedulableEntitiesLeak to avoid RM recovering apps from state
left by former test case.
(2) TestCapacityScheduler#testAllocateReorder always have a problem that only activate one
app but expect both two apps, it can pass before because app2 will be add into schedulable
entities through calling CapacityScheduler#allocate explicitly in this test case (add app2
into entitiesToReorder then add it into schedulableEntities) even though app2 is still not
activated. So that this problem is exposed because of this patch, and if set {{yarn.scheduler.capacity.maximum-am-resource-percent=1.0}}
then both two apps can be activated in this test case, This test case can pass again.

> Schedulable entities leak in fair ordering policy when recovering containers between
remove app attempt and remove app
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8958
>                 URL: https://issues.apache.org/jira/browse/YARN-8958
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.2.1
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-8958.001.patch, YARN-8958.002.patch
>
>
> We found a NPE in ClientRMService#getApplications when querying apps with specified queue.
The cause is that there is one app which can't be found by calling RMContextImpl#getRMApps(is
finished and swapped out of memory) but still can be queried from fair ordering policy.
> To reproduce schedulable entities leak in fair ordering policy:
> (1) create app1 and launch container1 on node1
> (2) restart RM
> (3) remove app1 attempt, app1 is removed from the schedulable entities.
> (4) recover container1, then the state of contianer1 is changed to COMPLETED, app1 is
bring back to entitiesToReorder after container released, then app1 will be added back into
schedulable entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler.
> (5) remove app1
> To solve this problem, we should make sure schedulableEntities can only be affected by
add or remove app attempt, new entity should not be added into schedulableEntities by reordering
process.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
>     //remove, update comparable data, and reinsert to update position in order
>     schedulableEntities.remove(schedulableEntity);
>     updateSchedulingResourceUsage(
>       schedulableEntity.getSchedulingResourceUsage());
>     schedulableEntities.add(schedulableEntity);
>   }
> {code}
> Related codes above can be improved as follow to make sure only existent entity can be
re-add into schedulableEntities.
> {code:java}
>   protected void reorderSchedulableEntity(S schedulableEntity) {
>     //remove, update comparable data, and reinsert to update position in order
>     boolean exists = schedulableEntities.remove(schedulableEntity);
>     updateSchedulingResourceUsage(
>       schedulableEntity.getSchedulingResourceUsage());
>     if (exists) {
>       schedulableEntities.add(schedulableEntity);
>     } else {
>       LOG.info("Skip reordering non-existent schedulable entity: "
>           + schedulableEntity.getId());
>     }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message