aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Khutornenko" <ma...@apache.org>
Subject Re: Review Request 41226: Handling task event race in updater.
Date Fri, 11 Dec 2015 19:59:31 GMT


> On Dec. 11, 2015, 7:07 p.m., Bill Farner wrote:
> > Can you take a stab at illustrating in the description how this problem manifests?
 The current shape of the patch feels like it papers over a problem that we might be left
to solve again, so thinking about the problem at a high level might help us think about a
broader fix.

I agree it may feel somewhat hacky but it's the easiest way to handle updater/event race without
adding more moving parts or excessive locking.

The core of the problem is (as mentioned in the description) in the way we process task events.
After moving into async processing, events can be (and are) delayed. It's a hard problem to
follow but I'll try to explain with an example of AURORA-1506:
- instance 1 fails while been watched by the updater, event F is dispatched (e.g.: RUNNING->FAILED)
- instance 1 is rescheduled and event R is dispatched (e.g.: INIT->THROTTLED)
- event F reached updater and updater decides to rollback, issues a kill for the instance
intending to return to previous config
- THROTTLED task is deleted from the store without firing a task change event (but firing
a task deleted event)
- event R is finally processed by async bus and updater is notified. Updater follows the evaluate
path and "thinks" a KILL is still required, which is legit but does not make sense as the
task is no longer in the store.

So, the crux of the problem is entering updater evaluator loop with a stale task info. Figuring
out if the task info is stale at the subscriber point is impossible as it requires updater
inner state. The best, as it seems to me, let the updater figure out what to do but prevent
the state changing action if it conflicts with the store. Sort of optimistic evaluation run.

Now to the question: The only downside of this race is a stack trace in the log. The actual
errors are handled by the `JobUpdateEventSubscriber` and do not fail the update. These errors
are polluting our logs and monitoring stats.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/41226/#review109984
-----------------------------------------------------------


On Dec. 11, 2015, 1:33 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/41226/
> -----------------------------------------------------------
> 
> (Updated Dec. 11, 2015, 1:33 a.m.)
> 
> 
> Review request for Aurora and Bill Farner.
> 
> 
> Bugs: AURORA-1506 and AURORA-1507
>     https://issues.apache.org/jira/browse/AURORA-1506
>     https://issues.apache.org/jira/browse/AURORA-1507
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Task event processing delays may generate duplicate evaluation results in the `InstanceActionHandler`.
This leads to occasional stack traces in the log but is generally benign as `JobUpdateEventSubscriber`
catches these errors. 
> 
> The easiest fix is to check for the task existence before attempting to kill/add instance.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/updater/InstanceActionHandler.java d8686f1f1b53e5ff2791663489e24c342503831e

>   src/test/java/org/apache/aurora/scheduler/updater/AddTaskTest.java 0583a63a175880355a7296ebd1c6e6fb5dc99f38

>   src/test/java/org/apache/aurora/scheduler/updater/KillTaskTest.java PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/41226/diff/
> 
> 
> Testing
> -------
> 
> ./gradlew -Pq build
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message