aurora-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David McLaughlin <da...@dmclaughlin.com>
Subject Re: Review Request 63536: Give jobs the ability to determine how to handle partitions by integrating with new Mesos Partition-Aware APIs
Date Tue, 07 Nov 2017 23:44:01 GMT


> On Nov. 7, 2017, 11:36 p.m., Stephan Erb wrote:
> > src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java
> > Lines 38-42 (patched)
> > <https://reviews.apache.org/r/63536/diff/1/?file=1884100#file1884100line38>
> >
> >     On the other hand, this would be consistent from the user perspective. A user
only sees the compacted task events with a single RUNNING to PARTITIONED transition. 
> >     
> >     If we cancel the future, he might wonder why the task does not end up as LOST.

Maybe it's not clear without tests - but this timer is responsible for moving from PARTITIONED
-> LOST after a specified delay. This comment is concerned about the user saying "give
up on a partition after 15 minutes" and then a task moving from RUNNING -> PARTITIONED
-> RUNNING -> PARTITIONED within a 15 minute window. When the task transitions back
to RUNNING the first time, the user would expect the timer to be reset.


- David


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63536/#review190383
-----------------------------------------------------------


On Nov. 7, 2017, 7:28 p.m., David McLaughlin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63536/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2017, 7:28 p.m.)
> 
> 
> Review request for Aurora, Jordan Ly, Santhosh Kumar Shanmugham, and Bill Farner.
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This is my prototype code for adding partition-awareness to Aurora. There is a proposal
document to accompany this here: https://docs.google.com/document/d/1E3GlsVTJLEMAkDWk2_PTxzkRZcapb8nF_5q5AADQI7g/edit#
> 
> I'd like feedback on the high-level approach before adding unit tests, metrics, logging,
etc.
> 
> 
> Diffs
> -----
> 
>   api/src/main/thrift/org/apache/aurora/gen/api.thrift c869493c06499340d73e1b219e17a0d7d8b5ead9

>   examples/vagrant/upstart/aurora-scheduler.conf 5ca3caef03b6632cd4dbf47711b1ef183f6a6449

>   src/main/java/org/apache/aurora/scheduler/base/Conversions.java 33cc012a2cad929b0dd1ce236597b870cfc5aba0

>   src/main/java/org/apache/aurora/scheduler/mesos/CommandLineDriverSettingsModule.java
5e83b2acdde7198d16427d4031e9772f78612554 
>   src/main/java/org/apache/aurora/scheduler/state/PartitionManager.java PRE-CREATION

>   src/main/java/org/apache/aurora/scheduler/state/SideEffect.java b91a0852e968b4aa9d74801601671cb61af3648a

>   src/main/java/org/apache/aurora/scheduler/state/StateManagerImpl.java 9989ed441cd9bc442e6472768880ce7924c3bdd9

>   src/main/java/org/apache/aurora/scheduler/state/StateModule.java d72f055749801ee9d6f31f60857cc795d0ed7ab1

>   src/main/java/org/apache/aurora/scheduler/state/TaskStateMachine.java eb4fe7d78ad1e6ec430c428df527bd0cf3a053c1

>   src/main/python/apache/aurora/client/cli/jobs.py b79ae56bee0e5692cacf1e66f4a4126b06aaffdc

>   src/main/python/apache/aurora/config/schema/base.py 18ce826363009e1cc0beac5cce4edf42610d487e

>   src/main/python/apache/aurora/config/thrift.py bedf8cd6641e1b1a930602791b758d584af4891c

> 
> 
> Diff: https://reviews.apache.org/r/63536/diff/1/
> 
> 
> Testing
> -------
> 
> Manual testing in Vagrant by stopping and starting the Mesos agent. With three jobs:
> 
> 1) No PartitionPolicy (verified existing behavior of moving from PARTITIONED directly
to LOST)
> 2) PartitionPolicy with custom delay_secs (verified sat in PARTITIONED for a while before
moving to LOST)
> 3) PartitionPolicy with reschedule=False (verified sat in PARTITIONED indefinitely)
> 
> I also verified tasks are able to transition to various states (back to RUNNING or moving
to FAILED, etc.) when you turn the agent back on.
> 
> 
> File Attachments
> ----------------
> 
> Task in PARTITIONED state
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/02c7fc72-b11d-4ef9-a86b-914e748cad99__Screen_Shot_2017-11-07_at_11.23.41_AM.png
> Task back into running when partition resolved
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/a0413f54-1572-4410-a386-0a22e78fab13__Screen_Shot_2017-11-07_at_11.26.02_AM.png
> Compaction of PARTITIONED cycles (note timestamps)
>   https://reviews.apache.org/media/uploaded/files/2017/11/07/edec32e5-b3ec-4fdc-b93f-5449519805ae__Screen_Shot_2017-11-07_at_11.27.47_AM.png
> 
> 
> Thanks,
> 
> David McLaughlin
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message