aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Review Request 50168: Add rollback functionality to the scheduler
Date Fri, 29 Jul 2016 20:42:46 GMT


> On July 29, 2016, 7:25 p.m., Maxim Khutornenko wrote:
> > Finally had time to go over this review in detail. I am afraid the proposed approach
of rolling back the arbitrary update may not work in general case and result in unexpected
outcome, inconsistent job state and hard to troubleshoot update failures. Once the global
update lock is lifted (e.g. the update finishes), subsequent updates and/or kill/add instance
RPC calls will drift the job state making the history obsolete. 
> > 
> > For example, consider updates changing instance counts:
> > U1: 5 -> 10
> > U2: 10 -> 20 (or addInstances(0, 10))
> > 
> > If we now attempt to rollback U1 we will end up with a hole [5-9] in the instance
count: 0-4, 10-19 AND likely unexpected extra 10 instances left (10-19). 
> > 
> > Following the above example we can come up with plenty of other intricate and hard
to reason scenarios. By allowing randomly old update rollback we are giving our users a powerful
tool that is very hard to understand or ensure correctness. 
> > 
> > The only way this _may_ be improved if we approach rollbackJobUpdate RPC as the
one creating a completely new update request with the old initial state used as the new desired.
This way _all_ current job instances are being considered, job diff gives correct output and
quota checks are properly done. The latter btw is another problem as rollbackJobUpdate does
not call `validateTaskLimits` to ensure there is enough quota to rollback something like 10
-> 5. The open problem: in case of multiple initial states to rollback to, which one to
choose as the new desired one? We can only choose one according to our thrift schema (for
a good reason). 
> > 
> > We have thought long and hard about the rollback cases in our internal deployment
orchestration tool and ended up with the approach that operates at the .aurora config (JobUpdateRequest)
level by creating a _new_ update with the target job config (JobUpdateRequest) of the _old_
one. This way we can safely rollback to the previously deployed version without violating
job/cluster invariants. 
> > 
> > Given the concerns in this CR, I suggest we move this discussion to dev list or
a design doc.
> > 
> > To seed our further discussion: we can consider an approach where instead of rolling
back a job update itself we rollback _to_ that update's target state. For example (desired
state === desiredState + settings + instanceCount):
> > U1: desired state A
> > U2: desired state B
> > U3: desired state C
> > 
> > To rollback to state B: U4: use desired state of U2
> > To rollback to state A: U4: use desired state of U1
> > 
> > While not ideal (e.g. there is no way to rollback to state A-1), this would be safer
and easier to reason about. It'd also require a smaller diff as most of the required code
paths already exist.
> 
> Igor Morozov wrote:
>     I think the semantics of rollback is pretty well defined: it is to rollback a job(in
whatever state it is right now) to the state defined as initial in the update job. Should
not scheduler try to kill extra instances in your example as they violate the job configuration
it is rolling back to? In this example that simply means having just 5 instances running after
rollback.

> Should not scheduler try to kill extra instances in your example as they violate the
job configuration it is rolling back to?

Unfortunately, that's not possible in the current implementation. Once an update object is
created, the initial and desired states contain only the working instance set calculated by
the `JobDiff`. Any instances created/killed after the update finishes are no longer considered.
The working set can be further reduced within a given update if `updateOnlyTheseInstances`
value is specified. 

In other words, the updater goes over a narrowly defined set of instances completely ignoring
the state of the job after the update has started.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/50168/#review144137
-----------------------------------------------------------


On July 28, 2016, 10:54 p.m., Igor Morozov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/50168/
> -----------------------------------------------------------
> 
> (Updated July 28, 2016, 10:54 p.m.)
> 
> 
> Review request for Aurora, Joshua Cohen, Maxim Khutornenko, Bill Farner, and Zameer Manji.
> 
> 
> Bugs: AURORA-1721
>     https://issues.apache.org/jira/browse/AURORA-1721
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add rollback functionality to the scheduler
> 
> Thic change also includes an addition to storage replicated log. I needed a way to re-insert
lock token for an existing job update.
> 
> 
> Diffs
> -----
> 
>   CHANGELOG fc6a46d77ebf889c8f60402c95e8a7472980ba1c 
>   RELEASE-NOTES.md 29d224d5596eb060356f1343cf347db79b1ad687 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d66208490aff6ea8af4c737845fa2cf13617529

>   api/src/main/thrift/org/apache/aurora/gen/storage.thrift 9e4213f13255a182df938bea44ca87fa03a25318

>   src/main/java/org/apache/aurora/scheduler/storage/JobUpdateStore.java 52c3c6618a3cf1009435ca8a9cece36365913e55

>   src/main/java/org/apache/aurora/scheduler/storage/db/DbJobUpdateStore.java d2673e6b328cb1e249fbe91d18e0d9e935636eaa

>   src/main/java/org/apache/aurora/scheduler/storage/log/LogStorage.java 39924c62108f6a68f545f90d77ceab4265d41fad

>   src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java d0de063fd78e6c4f62fae4a598d1d22f9775772d

>   src/main/java/org/apache/aurora/scheduler/thrift/SchedulerThriftInterface.java b534abf95bab6e1657e3ef993cf34c0d6ec460be

>   src/main/java/org/apache/aurora/scheduler/thrift/aop/AnnotatedAuroraAdmin.java 9243c92b11040b68ed6014b3991db69fc08bcddf

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateController.java f8357c46df1b025bf4e38a7ce1cb1c13a50c39f9

>   src/main/java/org/apache/aurora/scheduler/updater/JobUpdateControllerImpl.java 364c5c753f884a2d89e27802d7bbf3b2b6d3a08e

>   src/main/python/apache/aurora/client/api/__init__.py 68baf8fdb90cd26100159401c46c9963c24332b3

>   src/main/python/apache/aurora/client/cli/update.py bb526f7bf94d7bfe02fe2786493c85be1bfeb86f

>   src/test/java/org/apache/aurora/scheduler/updater/JobUpdaterIT.java e157c0dfde5efc418448e138aa008ade742fe816

>   src/test/python/apache/aurora/client/api/test_scheduler_client.py afbd385b7eda64cb1f7d118b695e65e4045eac6c

> 
> Diff: https://reviews.apache.org/r/50168/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Igor Morozov
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message