Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0F1DA16350B for ; Tue, 22 Aug 2017 05:21:21 +0200 (CEST) Received: (qmail 92501 invoked by uid 500); 22 Aug 2017 03:21:21 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 92490 invoked by uid 99); 22 Aug 2017 03:21:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Aug 2017 03:21:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EFB88C2D49; Tue, 22 Aug 2017 03:21:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.249 X-Spam-Level: *** X-Spam-Status: No, score=3.249 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id XMGD6bLOi3k0; Tue, 22 Aug 2017 03:21:18 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 74F845FD33; Tue, 22 Aug 2017 03:21:18 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 12A4EE0114; Tue, 22 Aug 2017 03:21:18 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id 6B061C40474; Tue, 22 Aug 2017 03:21:16 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============3243035808688849767==" MIME-Version: 1.0 Subject: Re: Review Request 61804: Fix race condition where rescinds are received but not processed before offer is accepted From: Zameer Manji To: Santhosh Kumar Shanmugham , David McLaughlin , Stephan Erb , Zameer Manji Cc: Aurora , Jordan Ly Date: Tue, 22 Aug 2017 03:21:15 -0000 Message-ID: <20170822032115.3008.59826@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: Zameer Manji X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/61804/ X-Sender: Zameer Manji References: <20170822003826.3008.11966@reviews-vm2.apache.org> In-Reply-To: <20170822003826.3008.11966@reviews-vm2.apache.org> Reply-To: Zameer Manji X-ReviewRequest-Repository: aurora --===============3243035808688849767== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61804/#review183436 ----------------------------------------------------------- I don't have the bandwidth to review this, but this fix seems to be a clever way of fixing this problem and far better than my previous approach. - Zameer Manji On Aug. 21, 2017, 5:38 p.m., Jordan Ly wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61804/ > ----------------------------------------------------------- > > (Updated Aug. 21, 2017, 5:38 p.m.) > > > Review request for Aurora, David McLaughlin, Santhosh Kumar Shanmugham, Stephan Erb, and Zameer Manji. > > > Bugs: AURORA-1945 > https://issues.apache.org/jira/browse/AURORA-1945 > > > Repository: aurora > > > Description > ------- > > The current race condition for offers is possible: > ``` > 1. Scheduler receives an offer and adds it to the executor queue for processing. > 2. The executor processes the offer and adds it to the HostOffers list. > 3. Scheduler receives a rescind for that offer and adds it to the executor queue for processing. However, there is a lot of load on the executor so there might be a delay between receiving the rescind and processing it. > 4. Scheduler accepts the offer before the rescind is processed by the executor. This will result in launching a task with an invalid offer leading to TASK_LOST. > ``` > The following logs show this in action: > > Mesos: > ``` > I0810 14:33:45.744372 19274 master.cpp:6065] Removing offer OFFER_X with revocable resources... > W0810 14:34:23.640905 19279 master.cpp:3696] Ignoring accept of offer OFFER_X since it is no longer valid > W0810 14:34:23.640923 19279 master.cpp:3709] ACCEPT call used invalid offers '[ OFFER_X ]': Offer OFFER_X is no longer valid > I0810 14:34:23.640974 19279 master.cpp:6253] Sending status update TASK_LOST for task TASK_Y with invalid offers: Offer OFFER_X is no longer valid' > ``` > Aurora: > ``` > I0810 14:28:45.676 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Received offer: OFFER_X > I0810 14:34:23.635 [TaskGroupBatchWorker, VersionedSchedulerDriverService] Accepting offer OFFER_X with ops [LAUNCH] > I0810 14:34:24.186 [Thread-4471585, MesosCallbackHandler$MesosCallbackHandlerImpl] Received status update for task TASK_Y in state TASK_LOST from SOURCE_MASTER with REASON_INVALID_OFFERS: Task launched with invalid offers: Offer_X is no longer valid > I0810 14:34:32.972 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl] Offer rescinded: OFFER_X > W0810 14:34:32.972 [SchedulerImpl-0, OfferManager$OfferManagerImpl] Failed to cancel offer: OFFER_X. > ``` > I would like to temporarily ban offers if we receive a rescind but the offer has not yet been added (ie. still in the executor queue). Then, when we actually process the offer we will not assign it to tasks since we know it has been rescinded already. When we ban the offer, we will also add a command to unban the offer to the executor queue so that future offers will not be affected. This solution should also avoid the race condition fixed in: https://issues.apache.org/jira/browse/AURORA-1933 > > > Diffs > ----- > > src/jmh/java/org/apache/aurora/benchmark/fakes/FakeOfferManager.java 6f2ca35c5d83dde29c24865b4826d4932e96da80 > src/main/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandler.java 2a42cac651729b8edec839c86ce406f76b17f810 > src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java a55f8add763f1d5ffbd964afd6e4615ff0021ea5 > src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java 25399e4a4b8f290065eacaf1e3ec1a36c131266b > src/test/java/org/apache/aurora/scheduler/mesos/MesosCallbackHandlerTest.java b5fa1c87e367e65d96d5a8eb0c9f43fd10d08d3e > src/test/java/org/apache/aurora/scheduler/offers/OfferManagerImplTest.java be02449eee97643b258792127521445a2c7fc0d3 > src/test/java/org/apache/aurora/scheduler/state/FirstFitTaskAssignerTest.java 25c1137920553774c32047088ace34279a71bbda > > > Diff: https://reviews.apache.org/r/61804/diff/2/ > > > Testing > ------- > > `./gradlew test` > > Ran `./src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh` successfully. > > I will verify this patch on a live cluster as well before submitting. > > > Thanks, > > Jordan Ly > > --===============3243035808688849767==--