aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <>
Subject Re: Review Request 51759: Batching writes - Part 1 (of 3): Introducing BatchWorker and task event batching.
Date Wed, 14 Sep 2016 22:34:23 GMT

> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > This patch LGTM. Just a single concern here and some questions. I will be moving
on to the subsequent patches shortly.
> > Please do not commit this until all parts are shipped. Also, please don't forget
to rebase/update subsequent patches if changes are made here.
> > 
> > Also, does this cover all state change consumers that interact on storage when the
event is fired or is this a subset? Just wondering if this is supposed to apply to all of
them or just a small set that you think need this change.

> Also, does this cover all state change consumers that interact on storage when the event
is fired or is this a subset

This covers all known high frequency consumers that require a write transaction. The list
is derived by running a stack-dumping version of the `CallOrderEnforcingStorage.write()` in
a live cluster for awhile and analysing top offenders. It may be not fully comprehensive though,
so if you find any other candidates feel free to report them.

> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/, line 233
> > <>
> >
> >     Instead of having an `Optional<BackoffStrategy>` and then later checking
it's present when we compute the backoff, we could simplify the internal implementation by
having non repeatable work items default to some sort of `NoopBackoffStrategy` instead.

I don't really see how that would be a better solution in this particular case. We do need
a `BackoffStrategy` in the repeatable work case it better be non-fake (and hard fail if not
provided). Having a noop strategy softens that constraint.

> On Sept. 14, 2016, 6:36 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/, line 69
> > <>
> >
> >     Just curious, is this default arbitrary or something that was derrived from
your deployment in production?

It's semi-arbitrary but proven in production :).

- Maxim

This is an automatically generated e-mail. To reply, visit:

On Sept. 14, 2016, 5:25 p.m., Maxim Khutornenko wrote:
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> -----------------------------------------------------------
> (Updated Sept. 14, 2016, 5:25 p.m.)
> Review request for Aurora, Joshua Cohen, Stephan Erb, and Zameer Manji.
> Repository: aurora
> Description
> -------
> This is the first (out of 3) patches intending to reduce storage write lock contention
and as such improve overall system write throughput. It introduces the `BatchWorker` and migrates
the majority of storage writes due to task status change events to use `TaskEventBatchWorker`.
> #####Problem
> Our current storage system writes effectively behave as `SERIALIZABLE` transaction isolation
level in SQL terms. This means all writes require exclusive access to the storage and no two
transactions can happen in parallel [1]. While it certainly simplifies our implementation,
it creates a single hotspot where multiple threads are competing for the storage write access.
This type of contention only worsens as the cluster size grows, more tasks are scheduled,
more status updates are processed, more subscribers are listening to status updates and etc.
Eventually, the scheduler throughput (and especially task scheduling) becomes degraded to
the extent that certain operations wait much longer (4x and more) for the lock acquisition
than it takes to process their payload when inside the transaction. Some ops (like event processing)
are generally tolerant of these types of delays. Others - not as much. The task scheduling
suffers the most as backing up the scheduling queue directly affects
  the Median Time To Assigned (MTTA).
> #####Remediation
> Given the above, it's natural to assume that reducing the number of write transactions
should help reducing the lock contention. This patch introduces a generic `BatchWorker` service
that delivers a "best effort" batching approach by redirecting multiple individual write requests
into a single FIFO queue served non-stop by a single dedicated thread. Every batch shares
a single write transaction thus reducing the number of potential write lock requests. To minimize
wait-in-queue time, items are dispatched immediately and the max number of items is bounded.
There are a few `BatchWorker` instances specialized on particular workload types: task even
processing, cron scheduling and task scheduling. Every instance can be tuned independently
(max batch size) and provides specialized metrics helping to monitor each workload type perf.
> #####Results
> The proposed approach has been heavily tested in production and delivered the best results.
The lock contention latencies got down between 2x and 5x depending on the cluster load. A
number of other approaches tried but discarded as not performing well or even performing much
worse than the current master:
> - Clock-driven batch execution - every batch is dispatched on a time schedule
> - Max batch with a deadline - a batch is dispatched when max size is reached OR a timeout
> - Various combinations of the above - some `BatchWorkers` are using clock-driven execution
while others are using max batch with a deadline
> - Completely non-blocking (event-based) completion notification - all call sites are
notified of item completion via a `BatchWorkCompleted` event
> Happy to provide more details on the above if interested.
> #####Upcoming
> The introduction of the `BatchWorker` by itself was not enough to substantially improve
the MTTA. It, however, paves the way for the next phase of scheduling perf improvement - taking
more than 1 task from a given `TaskGroup` in a single scheduling round (coming soon). That
improvement wouldn't deliver without decreasing the lock contention first. 
> Note: it wasn't easy to have a clean diff split, so some functionality in `BatchWorker`
(e.g.: `executeWithReplay`) appears to be unused in the current patch but will become obvious
in the part 2 (coming out shortly).  
> [1] -
> Diffs
> -----
>   src/main/java/org/apache/aurora/scheduler/ PRE-CREATION 
>   src/main/java/org/apache/aurora/scheduler/ 4a7ef0b1b90607f68d89fe8e207f42c42a8c56a0

>   src/main/java/org/apache/aurora/scheduler/pruning/ f07746c2b990c1c2235e99f9e4775fc84f9c27b1

>   src/main/java/org/apache/aurora/scheduler/scheduling/ bbd971a2aa8a96cf79edd879ad60e1bebd933d79

>   src/main/java/org/apache/aurora/scheduler/state/ 3c7cda09ab292d696070ca4d9dfedb1f6f71b0fe

>   src/main/java/org/apache/aurora/scheduler/updater/ 594bb6219294dcc77d48dcad14e2a6f9caa0c534

>   src/test/java/org/apache/aurora/scheduler/ PRE-CREATION 
>   src/test/java/org/apache/aurora/scheduler/pruning/ 99c27e8012f10a67ce5f1b84d258e7a5608995c7

>   src/test/java/org/apache/aurora/scheduler/scheduling/ 7d104aa2ea4a4d99be4711f666d18beca238284e

>   src/test/java/org/apache/aurora/scheduler/state/
>   src/test/java/org/apache/aurora/scheduler/testing/ PRE-CREATION

>   src/test/java/org/apache/aurora/scheduler/updater/ 196df4754b553f05e50b66ad2f84271901bc9eba

> Diff:
> Testing
> -------
> All types of testing including deploying to test and production clusters.
> Thanks,
> Maxim Khutornenko

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message