aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <da...@dmclaughlin.com>
Subject Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs
Date Tue, 14 Nov 2017 22:52:22 GMT


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > Lines 526 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885203#file1885203line526>
> >
> >     It seems logical that this accounting would live alongside `ScheduledTask.failureCount`.
> >     
> >     How would you feel about naming this `timesPartitioned` instead?  I feel that
more clearly disambiguates from overloaded meanings of `partition`, e.g. https://en.wikipedia.org/wiki/Partition_(database)

Thanks for this suggestion, I agree it's much clearer.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/base/Conversions.java
> > Lines 71-72 (original), 71-72 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885205#file1885205line71>
> >
> >     Comment is now stale.

Removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
> > Lines 100 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885206#file1885206line100>
> >
> >     How about `-partition_aware` instead?  I find the `enable` qualifier unnecessary.

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 31 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line31>
> >
> >     How about s/futureMap/delayedTransitions/

Map has been removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 40 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line40>
> >
> >     Please use `AsyncUtil.singleThreadLoggingScheduledExecutor()` instead.  In addition
to automatically logging and tracking exceptions, it will create a daemon thread.

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 48 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line48>
> >
> >     Consider using `Tasks.getLatestEvent(task).getTimestamp()`
> >     
> >     to hide the need for `Iterables.getLast()`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 53-54 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line53>
> >
> >     `String taskId = Tasks.id(stateChange.getTask());`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 76 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885207#file1885207line76>
> >
> >     Thinking out loud - the `Optional.of(PARTITIONED)` parameter is already helping
you perform a CAS.  If we had a _true_ CAS operation, you could very safely avoid the need
to cancel futures.  You can't do that with the status only since the task can now cycle with
`PARTITIONED`.
> >     
> >     What do you think about using the last transition timestamp to do this?
> >     
> >     ```java
> >     long casLastTransitionMs = Tasks.getLatestEvent(task).getTimestamp();
> >     executor.schedule(() -> {
> >       storage.write(... {
> >         if (casLastTransition == Tasks.getLatestEvent(task).getTimestamp()) {
> >           // proceed
> >         } else {
> >           // cas failed, the task has since transitioned
> >         }
> >       }
> >     
> >     }, delay, TimeUnit.SECONDS);
> >     ```
> 
> David McLaughlin wrote:
>     The other motivation (other than correctness, which I'm pretty sure we now have)
for removing and cancelling futures was to get the tasks off the heap. Of course, that's not
actually happening here because I'm not using a ScheduledThreadPoolExecutor (and setting removeOnCancelPolicy
to true). With the AsyncUtil suggestion above, I'll be sure to set that to make it clear what
I'm doing. 
>     
>     If we still think it's necessary to include this timestamp check, I'm happy to do
that (although we'd have to refetch the task from storage inside the storage.write of course).
> 
> Bill Farner wrote:
>     > get the tasks off the heap
>     
>     Smells like premature optimization.  Either, you can certainly factor this so you
only maintain a reference to the task ID and the timestamp.
> 
> David McLaughlin wrote:
>     Isn't premature optimization usually brandished in response to complexity though?
I don't see the complexity here.
> 
> Bill Farner wrote:
>     I'm simply pointing out that we shouldn't maintain the map if the primary motivation
is to reduce heap consumption.
>     
>     I do think the code would be easier to reason about if future cancellation and maintaining
the additional map were not necessary.  I don't feel strongly about this, however.

I took this suggestion and removed the map.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 294 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line294>
> >
> >     `==` instead of `.equals()`

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 297 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line297>
> >
> >     This is unnecessary.

Removed.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 390 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line390>
> >
> >     s/The goal here is to c/C/

Done.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java
> > Lines 400 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885209#file1885209line400>
> >
> >     It's worth mentioning that this is done to bound the number of events stored
for a task.

Added.


> On Nov. 9, 2017, 5:23 p.m., Bill Farner wrote:
> > src/main/java/org/apache/aurora/scheduler/state/StateModule.java
> > Lines 70 (patched)
> > <https://reviews.apache.org/r/63536/diff/2/?file=1885210#file1885210line70>
> >
> >     PubsubEventModule.bindSubscriber(binder, PartitionManager.class);

Done.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190607
-----------------------------------------------------------


On Nov. 14, 2017, 10:41 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 14, 2017, 10:41 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal
document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging,
etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift 1d369263d779b549b9304018437f535ddc855966

>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449

>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0

>   src/main/java/org/apache/aurora/scheduler/base/Jobs.java 8d5f4e57c6b4f847cb74471f246fd0b7dd0cbc36

>   src/main/java/org/apache/aurora/scheduler/base/TaskTestUtil.java 7c223eae69503fe1d5bf34c430438637abcbcb9b

>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION

>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a

>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9

>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java c03fff11ea3a4086f9daaa8b07315006c1b481e4

>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1

>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc

>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e

>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c

>   src/test/java/org/apache/aurora/scheduler/base/JobsTest.java 13f656f241a8a9a3d339f4053f165070c2669ef3

>   src/test/java/org/apache/aurora/scheduler/config/CommandLineTest.java c2d875bb5c393dd95d75251fe86dc649ceba7bd9

>   src/test/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModuleTest.java
7b0429109e9a7795e559db264e7737fc55ff0169 
>   src/test/java/org/apache/aurora/scheduler/state/PartitionManagerTest.java PRE-CREATION

>   src/test/java/org/apache/aurora/scheduler/state/StateManagerImplTest.java 0366cd6e9ddba0c3b9c88ffb50738767a352a17a

>   src/test/java/org/apache/aurora/scheduler/state/TaskStateMachineTest.java 8d6c3fff0af2df39bb929f760b862a2edf5d6fca

>   src/test/python/apache/aurora/client/cli/test_task.py 186cb2737ba8e169819b7d54f86a7344a669b6cb

> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/4/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly
to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before
moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving
to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message