aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zameer Manji <zma...@apache.org>
Subject Re: Review Request 59853: Process rescinds in the same thread pool as offers.
Date Tue, 06 Jun 2017 20:56:19 GMT

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59853/
-----------------------------------------------------------

(Updated June 6, 2017, 1:56 p.m.)


Review request for Aurora, David McLaughlin and Santhosh Kumar Shanmugham.


Bugs: AURORA-1933
    https://issues.apache.org/jira/browse/AURORA-1933


Repository: aurora


Description (updated)
-------

In a a production environment I was able to observe the following:
```
I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer
rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received
offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer
81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
```

Notice that the offer rescind was processed before the actual offer. This is
possible because there is a race in the `MesosCallbackHandlerImpl`. The offer is
processed in the executor (to prevent blocking) and the rescind is handled
directly. This means the offer procecssing thread (`SchedulerImpl-0`) is racing
against the callback thread (`Thread-77638`).

In normal operation, there will be seconds to minutes between a rescind and an
offer, but in some cases an offer can be rescinded very quickly in clusters that
use oversubscription modules.

To fix this, we move the rescind processing into the same executor as the offer
processing to ensure they are processed in the order they are received. Without
fixing this, the rescinded offer exists in the offer manager and can be used
later to launch a task. This task will immediately fail to launch because the
offer is invalid.

In this patch, I have also added a metric and logging to record when we fail to
remove an offer from the offer manager, and cleaned up the logging to allow
operators to see when an offer was recieved. With this logging, an operator can
grep for the offer id and see the entire lifecycle of the offer in the
scheduler.


Diffs
-----

  src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 5a5281aeaea1e2a4e0eab67069605838ee809c6c

  src/main/java/org/apache/aurora/scheduler/mesos/VersionedSchedulerDriverService.java 5e86504c70083065278864e6ab1cc85c83a45a28

  src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java 17e577b069df9232d57cde171a078d9f6db707ea

  src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java 97febf25cea2024e0ca43366b3d4578e67734884



Diff: https://reviews.apache.org/r/59853/diff/1/


Testing
-------


Thanks,

Zameer Manji


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message