aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.
Date Thu, 22 Sep 2016 15:19:46 GMT


> On Sept. 22, 2016, 2:25 p.m., Zameer Manji wrote:
> > src/main/java/org/apache/aurora/scheduler/BatchWorker.java, line 152
> > <https://reviews.apache.org/r/52141/diff/1/?file=1507901#file1507901line152>
> >
> >     Something is wrong here.
> >     
> >     The listener from `GuavaUtils.LifecycleShutdownListener` should have already
been applied to this service in `SchedulerServicesModule`. We used `addSchedulerActiveServiceBinding`
to add the BatchWorker services so I'm not sure why this helps at all.

Ah, that's right, that should be enough. So, the fix is even simpler, just let the error bubble
up and the existing shutdown hook will take care of the rest. Verified in vagrant that's exactly
what happens.


- Maxim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150000
-----------------------------------------------------------


On Sept. 22, 2016, 5:38 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 5:38 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of the individual
batch items fail. This may not hold true if the storage is modified _before_ the error is
thrown and the `LogStorage` transaction is not rolled back. Consider the following fragment
from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND `BatchWorker`
catches and logs that exception, the native store transaction will be committed and as a result
the storage will be left in a logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the failed item
would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming batch items
are idempotent, which may not be the case). The #2 may result in a very hard to troubleshoot
scenario where _some_ items would be dropped on the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. It's also
the only one that ensures storage state consistency, which is the most vital invariant to
preserve (as AURORA-1779 proves). The only downside of this approach is that scheduler will
go down hard any time an unhandled error is thrown but arguably this is the easiest way to
improve our codebase and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java e05d4b48749b0691902a505bea1b4490fdd08f29

>   src/main/java/org/apache/aurora/scheduler/SchedulerModule.java 2ec3967ddb1d470cf681de062a6400f647978185

>   src/main/java/org/apache/aurora/scheduler/cron/quartz/AuroraCronJob.java 7c8047a7235a937c29fe96517242923ff84a080c

>   src/main/java/org/apache/aurora/scheduler/scheduling/TaskGroups.java d390c07522d22e43d79ce4370985f3643ef021ca

>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java a86dc82cafa7f5436d2b8d00c6db575ff8311eea

>   src/test/java/org/apache/aurora/scheduler/cron/quartz/CronIT.java 8556253fc11f6027316871eb9dc66d8627a77fe6

> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message