aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurora ReviewBot <wfar...@apache.org>
Subject Re: Review Request 59853: Process rescinds in the same thread pool as offers.
Date Tue, 06 Jun 2017 19:59:11 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/#review177079
-----------------------------------------------------------


Ship it!




Master (2cbaeec) is green with this patch.
  ./build-support/jenkins/build.sh

I will refresh this build result if you post a review containing "@ReviewBot retry"

- Aurora ReviewBot


On June 6, 2017, 7:42 p.m., Zameer Manji wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59853/
> -----------------------------------------------------------
> 
> (Updated June 6, 2017, 7:42 p.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.
> 
> 
> Bugs: AURORA-1933
>     https://issues.apache.org/jira/browse/AURORA-1933
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> In a a production environment I was able to observe the following:
> ```
> I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229]
Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211]
Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
> I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting
offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
> ```
> 
> Notice that the offer rescind was processed before the actual offer. This is
> possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
> processed in the executor (to prevent blocking) and the rescind is handled
> directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
> against the callback thread (`Thread-77638`).
> 
> In normal operation, there will be seconds to minutes between a rescind and an
> offer, but in some cases an offer can be rescinded very quickly in clusters that
> use oversubscription modules.
> 
> To fix this, we move the rescind processing into the same executor as the offer
> processing to ensure they are processed in the order they are recived. Without
> fixing this, the rescinded offer exists in the offer manager and can be used
> later to launch a task. This task will immediately fail to launch because the
> offer is invalid.
> 
> In this patch, I have also added a metric and logging to record when we fail to
> remove an offer from the offer manager, and cleaned up the logging to allow
> operators to see when an offer was recieved. With this logging, an operator can
> grep for the offer id and see the entire lifecycle of the offer in the
> scheduler.
> 
> 
> Diffs
> -----
> 
>   src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c

>   src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java
5e86504c70083065278864e6ab1cc85c83a45a28 
>   src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea

>   src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884

> 
> 
> Diff: https://reviews.apache.org/r/59853/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Zameer Manji
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message