aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Review Request 52141: Shutting down scheduler on unhandled BatchWorker error.
Date Thu, 22 Sep 2016 17:45:34 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/52141/#review150035
-----------------------------------------------------------


Ship it!





src/main/java/org/apache/aurora/scheduler/BatchWorker.java (line 195)
<https://reviews.apache.org/r/52141/#comment217850>

    For posterity,
    
    This is important because the Guava docs says that `run()` should respond to shutdown
requests by checking `isRunning()`. If we block indefinately we technically don't respond
to shutdown requests.


- Zameer Manji


On Sept. 22, 2016, 8:20 a.m., Maxim Khutornenko wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/52141/
> -----------------------------------------------------------
> 
> (Updated Sept. 22, 2016, 8:20 a.m.)
> 
> 
> Review request for Aurora, Stephan Erb and Zameer Manji.
> 
> 
> Bugs: AURORA-1779
>     https://issues.apache.org/jira/browse/AURORA-1779
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> ######Problem
> 
> The current `BatchWorker` error handling assumes graceful recovery if any of the individual
batch items fail. This may not hold true if the storage is modified _before_ the error is
thrown and the `LogStorage` transaction is not rolled back. Consider the following fragment
from the `StateManagerImpl`:
> ```
>     storeProvider.getUnsafeTaskStore().saveTasks(scheduledTasks);
> 
>     for (IScheduledTask scheduledTask : scheduledTasks) {
>       updateTaskAndExternalState(
>           storeProvider.getUnsafeTaskStore(),
>           Tasks.id(scheduledTask),
>           Optional.of(scheduledTask),
>           Optional.of(PENDING),
>           Optional.absent());
>     }
> ```
> If a task is saved but the subsequent `updateTaskAndExternalState()` throws AND `BatchWorker`
catches and logs that exception, the native store transaction will be committed and as a result
the storage will be left in a logically inconsistent state.
> 
> ######Solution
> 
> I can see at least 3 ways to solve this problem:
> 1. Fail hard and shutdown scheduler when any of the batch items throw;
> 2. Fail only the last batch (drop all its items) and let storage transaction roll back;
> 3. Implement `BatchWorker` transaction where items _before_ and _after_ the failed item
would be retried when the storage transaction rolls back;
> 
> The #3 is the most involved and would be very hard to get right (assuming batch items
are idempotent, which may not be the case). The #2 may result in a very hard to troubleshoot
scenario where _some_ items would be dropped on the floor and never completed. 
> 
> I suggest we go with #1 as the most straightforward and transparent approach. It's also
the only one that ensures storage state consistency, which is the most vital invariant to
preserve (as AURORA-1779 proves). The only downside of this approach is that scheduler will
go down hard any time an unhandled error is thrown but arguably this is the easiest way to
improve our codebase and certainly better than leaving the scheduler in a crippled state.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/BatchWorker.java e05d4b48749b0691902a505bea1b4490fdd08f29

>   src/test/java/org/apache/aurora/scheduler/BatchWorkerTest.java a86dc82cafa7f5436d2b8d00c6db575ff8311eea

> 
> Diff: https://reviews.apache.org/r/52141/diff/
> 
> 
> Testing
> -------
> 
> unit and e2e tests
> 
> 
> Thanks,
> 
> Maxim Khutornenko
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message