aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <da...@dmclaughlin.com>
Subject Re: Review Request 59480: Expose bin-packing options via OfferManager ordering.
Date Wed, 31 May 2017 21:44:40 GMT


> On May 26, 2017, 8:23 a.m., Stephan Erb wrote:
> > I have thought about this in the past as well, and I am slightly skeptical if a
simple sort by resources is a good approach when using oversubscription. Especially given
the goal to run almost all (non-production) tasks as revocable.
> > 
> > Offers do contain both revocable and non-revocable resources, so we would need to
find an ordering that is suitable for both:
> > 
> > * Sort by non-revocable resources first: We will pack the production tasks tightly
but those are now pretty simple to schedule anyway, as most stuff is running as revocable.
As the corresponding downside we don't reduce fragmentation for revocable resources.
> > 
> > * Sorting by revocable resources: offers with the fewest revocable resources will
be at the front. Given the nature of the oversubscription, these will be the nodes with the
highest load (as those are capped to 0 revocable resources). If we schedule further non-revocable
tasks on there, we will either get bad performance or trigger the termination of revocable
tasks (depending on our `Estimator` and `QoSController`).
> > 
> > 
> > Alternative first-fit solutions that come to mind (partially inspired by http://shonan.nii.ac.jp/shonan/seminar011/files/2012/02/stein.pdf)
> > 
> > * Sort by agent id. Over time, this should converge to the same cluster state as
sorting offers by size. The upside is 
> >   that it would work for revocable and non-revocable resources simultaneously.
> > 
> > * Compute a score per offer and then schedule on the node with the smallerst offer.
As an approximation
> >   we could use something like `10 ** (free_ram/total_ram) + 10 ** (free_revocable_ram/total_revocable_ram)
...`. 
> > 
> > * Extend the sort by revocable resources with a special case for 0 remaining revocable
resources.
> 
> David McLaughlin wrote:
>     I guess my main question here is: why wouldn't your alternative orderings not just
be submitted as patches?
>     
>     
>     
>     Sorting offers is definitely not a panacea and I didn't intend to suggest that. Please
see my reply to Santhosh too. I will clean up the README/docs if we decide to ship this. 

>     
>     
>     To clarify further: 
>     
>     What this patch is trying to enable is the ability to influence global ordering that
is better than random. So it's basically a chance to say "this is what I would prefer to happen
for _all_ tasks in the cluster absent of any other information". E.g. at Twitter CPU is by
far our scarcest resource and any bin-packing solution would be based almost solely on that.

>     
>     But what you really want is a score-based approach that makes smart decisions _per-task_.
This is sort of what constraints are, except for bin-packing we don't want to avoid scheduling
things when ideal circumstances cannot be found, but instead make sure we pick the best of
a bad bunch. The problem is performance. 
>     
>     Our steady-state at Twitter is about 300 calls to TaskAssigner.maybeAssign per minute
- this is your crons, failing tasks and other recurring scheduling load. When engineers start
deploying, we see that number increase to around 1~2k per minute. 
>     
>     The current algorithm has a worst-case complexity of O(o·n) where o = number of
offers and n = number of assign attempts. (Let's assume the filtering done per offer is a
constant cost). Worst-case complexity is met when none of the offers match a task which isn't
all that uncommon. With 2k attempts and 40k offers, this is already getting close to being
a scheduling bottleneck for us, e.g. http://i.imgur.com/C9O8BN8.png.
>     
>     
>     A score-based approach that ranks all offers based on what is ideal for the current
task is much more complex. You need to build a sorted structure of all offers for each maybeAssign
attempt. So you now have something like a O(o·log(o) · n) algorithm. You can optimize n
by increasing the batch size, etc. but there's an upper bound on this (and a bunch of other
negative trade-offs). So the only other approach is to limit o, either by maintaining smaller
clusters (which is recommended in the Borg paper) or by sampling, where you score a fixed-size
number of of offers rather than the entire set. 
>     
>     We're still considering both approaches to reducing the number of offers. But with
the latter approach of scoring only a subset of matching offers, what we've ran into is that
with a random ordering of all offers (e.g. OfferManager.getOffers on master), the % of improvement
in bin-packing vs first-fit and random is completely negligible without blowing up performance
by increasing the sample size. One of the big problems is that as soon as a task is scheduled,
the previous filtering decision on all offers are now potentially invalidated due to things
like host and limit constraints. But when I added a CPU sort on the offers, the results were
dramatically improved - only a couple % points away from a perfect bin-packing simulation.
So that's the main motivation for this patch. 
>     
>     Unfortunately we're stuck in the short term on this single-writer single-reader architecture
for Aurora, so we can't even fan out this workload by increasing number of replicas, etc.
> 
> Stephan Erb wrote:
>     Thanks for all the details! Is the 90% improvement you have stated in the description
of the patch independent of the score-based approach? Or does it already require scoring?
>     
>     I am OK with getting this patch ,erged. We should however make sure it is clear that
operators understand the potential downsides (e.g. constant rescheduling on broken nodes).
>     
>     I would love to see that this is also working for revocable resources, but I can
look into that in a follow-up patch :)

Yeah, the improvement was with this patch alone. 

I split resources into revocable/non-revocable when ordering, and added a new REVOCABLE_CPU
order. RAM/DISK are not shareable resources so revocable and non-revocable seem to be treated
the same.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59480/#review176057
-----------------------------------------------------------


On May 31, 2017, 9:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59480/
> -----------------------------------------------------------
> 
> (Updated May 31, 2017, 9:41 p.m.)
> 
> 
> Review request for Aurora, Santhosh Kumar Shanmugham and Stephan Erb.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This patch enables scalable, high-performance Scheduler bin-packing using the existing
first-fit task assigner, and it can be controlled with a simple command line argument. 
> 
> The bin-packing is only an approximation, but can lead to pretty significant improvements
in resource utilization per agent. For example, on a CPU-bound cluster with 30k+ hosts and
135k tasks (across 1k+ jobs) - we were able to reduce the number of hosts with tasks scheduled
on them to just 90%, down from 99.7% (as one would expect from randomization). So if you are
running Aurora on elastic computing and paying for machines by the minute/hour, then utilizing
this patch _could_ allow you to reduce your server footprint by as much as 10%. 
> 
> The approximation is based on the simple idea that you have the best chance of having
perfect bin-packing if you put tasks in the smallest slot available. So if you have a task
needing 8 cores and you have an 8 core and 12 core offer available - you'd always want to
put the task in the 8 core offer*. By sorting offers in OfferManager during iteration, then
a first-fit algorithm is guaranteed to match the smallest possible offer for your task and
achieves this.
> 
> * - The correct decision of course depends on the other pending tasks and the other resources
available, and more satisfactory results may also need preemption, etc.
> 
> 
> Diffs
> -----
> 
>   RELEASE-NOTES.md 75b3ddb856dc5d889a9006490f57cc58ee7d82fc 
>   src/jmh/java/org/apache/aurora/benchmark/SchedulingBenchmarks.java b933a5bbc2d1d5edb5e473135fb523a1fe02db35

>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 78255e6dfa31c4920afc0221ee60ec4f8c2a12c4

>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrder.java PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferOrderBuilder.java PRE-CREATION

>   src/main/java/org/apache/aurora/scheduler/offers/OfferSettings.java adf7f33e4a72d87c3624f84dfe4998e20dc75fdc

>   src/main/java/org/apache/aurora/scheduler/offers/OffersModule.java 317a2d26d8bfa27988c60a7706b9fb3aa9b4e2a2

>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java d7addc0effb60c196cf339081ad81de541d05385

>   src/test/java/org/apache/aurora/scheduler/resources/ResourceTestUtil.java 676d305d257585e53f0a05b359ba7eb11f1b23be

> 
> 
> Diff: https://reviews.apache.org/r/59480/diff/2/
> 
> 
> Testing
> -------
> 
> This has been scale-tested with production-like workloads and performs well, adding only
a few extra seconds total in TaskAssigner when applied to thousands of tasks per minute. 
> 
> There is an  overhead when scheduling tasks that have large resource requirements - as
the task assigner will first need to skip offer all the offers with low resources. In a packed
cluster, this is where the extra seconds are spent. This could be reduced by just jumping
over all the offers we know to be too small, but that decision has to map to the OfferOrder
(which adds complexity). That can be addressed in a follow-up review if needed.
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message