Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 23081200D37 for ; Thu, 9 Nov 2017 23:21:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 21721160BEF; Thu, 9 Nov 2017 22:21:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 407631609C8 for ; Thu, 9 Nov 2017 23:21:32 +0100 (CET) Received: (qmail 75758 invoked by uid 500); 9 Nov 2017 22:21:31 -0000 Mailing-List: contact reviews-help@aurora.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: reviews@aurora.apache.org Delivered-To: mailing list reviews@aurora.apache.org Received: (qmail 75747 invoked by uid 99); 9 Nov 2017 22:21:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Nov 2017 22:21:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 58A3D180787; Thu, 9 Nov 2017 22:21:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.251 X-Spam-Level: *** X-Spam-Status: No, score=3.251 tagged_above=-999 required=6.31 tests=[HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, KAM_LAZY_DOMAIN_SECURITY=1, KAM_LOTSOFHASH=0.25, RP_MATCHES_RCVD=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id HjDr5nbNh42p; Thu, 9 Nov 2017 22:21:28 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 8265E5FAF9; Thu, 9 Nov 2017 22:21:27 +0000 (UTC) Received: from reviews.apache.org (unknown [10.41.0.12]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D0D71E069F; Thu, 9 Nov 2017 22:21:26 +0000 (UTC) Received: from reviews-vm2.apache.org (localhost [IPv6:::1]) by reviews.apache.org (ASF Mail Server at reviews-vm2.apache.org) with ESMTP id C82F3C402DD; Thu, 9 Nov 2017 22:21:26 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4176573642512722671==" MIME-Version: 1.0 Subject: Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs From: David McLaughlin To: Bill Farner , Santhosh Kumar Shanmugham , Jordan Ly Cc: Aurora , Stephan Erb , Aurora ReviewBot , David McLaughlin Date: Thu, 09 Nov 2017 22:21:26 -0000 Message-ID: <20171109222126.37792.82892@reviews-vm2.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: David McLaughlin X-ReviewGroup: Aurora X-Auto-Response-Suppress: DR, RN, OOF, AutoReply X-ReviewRequest-URL: https://reviews.apache.org/r/63536/ X-Sender: David McLaughlin References: <20171109172323.34251.25126@reviews-vm2.apache.org> In-Reply-To: <20171109172323.34251.25126@reviews-vm2.apache.org> X-ReviewBoard-Diff-For: src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java Reply-To: David McLaughlin X-ReviewRequest-Repository: aurora archived-at: Thu, 09 Nov 2017 22:21:33 -0000 --===============4176573642512722671== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote: > > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java > > Lines 76 (patched) > > > > > > Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping you perform a CAS. If we had a _true_ CAS operation, you could very safely avoid the need to cancel futures. You can't do that with the status only since the task can now cycle with `PARTITIONED`. > > > > What do you think about using the last transition timestamp to do this? > > > > ```java > > long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp(); > > executor.schedule(() -> { > > storage.write(... { > > if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) { > > // proceed > > } else { > > // cas failed, the task has since transitioned > > } > > } > > > > }, delay, TimeUnit.SECONDS); > > ``` > > David McLaughlin wrote: > The other motivation (other than correctness, which I'm pretty sure we now have) for removing and cancelling futures was to get the tasks off the heap. Of course, that's not actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what I'm doing. > > If we still think it's necessary to include this timestamp check, I'm happy to do that (although we'd have to refetch the task from storage inside the storage.write of course). > > Bill Farner wrote: > > get the tasks off the heap > > Smells like premature optimization. Either, you can certainly factor this so you only maintain a reference to the task ID and the timestamp. Isn't premature optimization usually brandished in response to complexity though? I don't see the complexity here. - David ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63536/#review190607 ----------------------------------------------------------- On Nov. 9, 2017, 12:48 a.m., David McLaughlin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63536/ > ----------------------------------------------------------- > > (Updated Nov. 9, 2017, 12:48 a.m.) > > > Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner. > > > Repository: aurora > > > Description > ------- > > This is my prototype code for adding partition-awareness to Aurora. There is a proposal document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit# > > I'd like feedback on the high-level approach before adding unit tests, metrics, logging, etc. > > > Diffs > ----- > > api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9 > examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449 > src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0 > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java 5e83b2acdde7198d16427d4031e9772f78612554 > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION > src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9 > src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1 > src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1 > src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc > src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e > src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c > > > Diff: https://reviews.apache.org/r/63536/diff/2/ > > > Testing > ------- > > Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs: > > 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly to LOST) > 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before moving to LOST) > 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely) > > I also verified tasks are able to transition to various states (back to RUNNING or moving to FAILED, etc.) when you turn the agent back on. > > > File Attachments > ---------------- > > Task in PARTITIONED state > https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png > Task back into running when partition resolved > https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png > Compaction of PARTITIONED cycles (note timestamps) > https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png > > > Thanks, > > David McLaughlin > > --===============4176573642512722671==--