hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Carlo Curino (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4198) CapacityScheduler locking / synchronization improvements
Date Wed, 16 Dec 2015 18:23:47 GMT

    [ https://issues.apache.org/jira/browse/YARN-4198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15060459#comment-15060459

Carlo Curino commented on YARN-4198:

[~xinxianyin] the way we got to this was by running a "busy" workload with lots of reservation-related
pressure to the CS, staring at a profiler and progressively work out what locks could be weakened,
which data structures could be changed to improve the performance of the scheduler. 

I think this is looking at the same set of problems you are tracked in YARN-3091 but with
a particular focus on the needs of the reservation system. I expect the changes in this patch
(we will post an initial version soon), to be generally useful, and possibly partially overlapping
some of YARN-3091 sub-JIRAs. 

The improvements we observed were very substantial (we went from thrashing on locks in a 256
nodes cluster at 50-60 concurrent reservations to jug along nicely on 2700 nodes cluster at
over 1000 concurrent reservations). Note that all that testing was done for this patch combined
with the rest of YARN-4193 work, therefore I suggest that:
 # We will do a round of tests of this patch in isolation to make sure the changes are good
independently of the rest of what we did in YARN-4193.
 # Post a version of the patch. 
 # You can review it and help us figure out whether: 1) it is good/safe/agreeable, 2) how
it relates with some of the other efforts that are ongoing (might resolve some of the sub-JIRAs
or provide partial work towards them). 

[~kshukla], [~wangda], [~jianhe], [~jlowe] if you guys have time to look at this as well,
it would be great. As I mentioned to some of you already, this is a very delicate portion
of the scheduler, and we need lots of eyes (ideally both staring at the patch and testing
independently on a cluster) to convince ourselves that what is proposed is safe/correct and

> CapacityScheduler locking / synchronization improvements
> --------------------------------------------------------
>                 Key: YARN-4198
>                 URL: https://issues.apache.org/jira/browse/YARN-4198
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Carlo Curino
>            Assignee: Alexey Tumanov
> In the context of YARN-4193 (which stresses the RM/CS performance) we found several performance
problems with  in the locking/synchronization of the CapacityScheduler, as well as inconsistencies
that do not normally surface (incorrect locking-order of queues protected by CS locks etc).
This JIRA proposes several refactoring that improve this.

This message was sent by Atlassian JIRA

View raw message