aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Review Request 51929: Scheduling multiple tasks per round.
Date Fri, 16 Sep 2016 01:54:30 GMT


> On Sept. 16, 2016, 1:20 a.m., Aurora ReviewBot wrote:
> > Master (783baae) is red with this patch.
> >   ./build-support/jenkins/build.sh
> > 
> >                              # Create file stdout for capturing output. We can't
use StringIO mock
> >                              # because TestProcess is running fork.
> >                              with open(os.path.join(td, 'sys_stdout'), 'w+')
as stdout:
> >                                with open(os.path.join(td, 'sys_stderr'), 'w+')
as stderr:
> >                                  with mutable_sys():
> >                                    sys.stdout, sys.stderr = stdout, stderr
> >                          
> >                                    p = TestProcess('process', 'echo hello world;
echo >&2 hello stderr', 0,
> >                                                    taskpath, sandbox, logger_destination=LoggerDestination.BOTH)
> >                                    p.start()
> >                                    rc = wait_for_rc(taskpath.getpath('process_checkpoint'))
> >                          
> >                                    assert rc == 0
> >                                    # Check log files were created in std path
with correct content
> >                      >             assert_log_content(taskpath, 'stdout',
'hello world\n')
> >                      
> >                      src/test/python/apache/thermos/core/test_process.py:487: 
> >                      _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> >                      
> >                      taskpath = <apache.thermos.common.path.TaskPath object at
0x7fdd3cd73b10>
> >                      log_name = 'stdout'
> >                      expected_content = 'hello world\n'
> >                      
> >                          def assert_log_content(taskpath, log_name, expected_content):
> >                            log = taskpath.with_filename(log_name).getpath('process_logdir')
> >                            assert os.path.exists(log)
> >                            with open(log, 'r') as fp:
> >                      >       assert fp.read() == expected_content
> >                      E       assert '' == 'hello world\n'
> >                      E         + hello world
> >                      
> >                      src/test/python/apache/thermos/core/test_process.py:313: AssertionError
> >                       generated xml file: /home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml

> >                       1 failed, 710 passed, 6 skipped, 1 warnings in 226.09
seconds 
> >                      
> > FAILURE
> > 
> > 
> > 01:19:57 04:18   [complete]
> >                FAILURE
> > 
> > 
> > I will refresh this build result if you post a review containing "@ReviewBot retry"

@ReviewBot retry


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/51929/#review149162
-----------------------------------------------------------


On Sept. 16, 2016, 12:51 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/51929/
> -----------------------------------------------------------
> 
> (Updated Sept. 16, 2016, 12:51 a.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is phase 2 of scheduling perf improvement effort started in https://reviews.apache.org/r/51759/.
> 
> We can now take multiple (configurable) number of task IDs from a given `TaskGroup` per
scheduling. The idea is to go deeper through the offer queue and assign more than one task
if possible. This approach delivers substantially better MTTA and still ensures fairness across
multiple `TaskGroups`. We have observed almost linear improvement in MTTA (4x+ with 5 tasks
per round), which suggest the `max_tasks_per_schedule_attempt` can be set even higher if the
majority of cluster jobs have large number of instances and/or update batch sizes.
> 
> As far as a single round perf goes, we can consider the following 2 worst-case scenarios:
> - master: single task scheduling fails after trying all offers in the queue
> - this patch: N tasks launched with the very last N offers in the queue + `(N x single_task_launch_latency)`
> 
> Assuming that matching N tasks against M offers takes exactly the same time as 1 task
against M offers (as they all share the same `TaskGroup`), the only measurable difference
comes from the additional `N x single_task_launch_latency` overhead. Based on real cluster
observations, the `single_task_launch_latency` is less than 1% of a single task scheduling
attempt, which is << than the savings from avoided additional scheduling rounds. 
> 
> As far as jmh results go, the new approach (batching + multiple tasks per round) is only
slightly more demanding (~8%). Both results though are MUCH higher than the real cluster perf,
which just confirms we are not bound by CPU time here:
> 
> Master:
> ```
> Benchmark                                                                    Mode  Cnt
     Score     Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  thrpt   10
 17126.183 ± 488.425  ops/s
> ```
> 
> This patch:
> ```
> Benchmark                                                                    Mode  Cnt
     Score     Error  Units
> SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark.runBenchmark  thrpt   10
 15838.051 ± 187.890  ops/s
> ```
> 
> NOTE: this will not apply cleanly as it branched off of https://reviews.apache.org/r/51765,
which itself depends on https://reviews.apache.org/r/51759/.
> 
> 
> Diffs
> -----
> 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java 9d0d40b82653fb923bed16d06546288a1576c21d

>   src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java 87b9e1928ab2d44668df1123f32ffdc4197c0c70

>   src/main/java/org/apache/aurora/scheduler/scheduling/SchedulingModule.java 11e8033438ad0808e446e41bb26b3fa4c04136c7

>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroup.java 5d319557057e27fd5fc6d3e553e9ca9139399c50

>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java c044ebe6f72183a67462bbd8e5be983eb592c3e9

>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskScheduler.java d266f6a25ae2360db2977c43768a19b1f1efe8ff

>   src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 7f7b4358ef05c0f0d0e14daac1a5c25488467dc9

>   src/test/java/org/apache/aurora/scheduler/events/NotifyingSchedulingFilterTest.java
ece476b918e6f2c128039e561eea23a94d8ed396 
>   src/test/java/org/apache/aurora/scheduler/filter/AttributeAggregateTest.java 209f9298a1d55207b9b41159f2ab366f92c1eb70

>   src/test/java/org/apache/aurora/scheduler/filter/SchedulingFilterImplTest.java 0cf23df9f373c0d9b27e55a12adefd5f5fd81ba5

>   src/test/java/org/apache/aurora/scheduler/http/AbstractJettyTest.java c2ceb4e7685a9301f8014a9183e02fbad65bca26

>   src/test/java/org/apache/aurora/scheduler/preemptor/PreemptionVictimFilterTest.java
ee5c6528af89cc62a35fdb314358c489556d8131 
>   src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorImplTest.java 98048fabc00f233925b6cca015c2525980556e2b

>   src/test/java/org/apache/aurora/scheduler/preemptor/PreemptorModuleTest.java 2c3e5f32c774be07a5fa28c8bcf3b9a5d88059a1

>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskGroupsTest.java 95cf25eda0a5bfc0cc4c46d1439ebe9d5359ce79

>   src/test/java/org/apache/aurora/scheduler/scheduling/TaskSchedulerImplTest.java 72562e6bd9a9860c834e6a9faa094c28600a8fed

>   src/test/java/org/apache/aurora/scheduler/state/TaskAssignerImplTest.java b4d27f69ad5d4cce03da9f04424dc35d30e8af29

> 
> Diff: https://reviews.apache.org/r/51929/diff/
> 
> 
> Testing
> -------
> 
> All types of testing including deploying to test and production clusters.
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message