aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <>
Subject Re: aurora watch_secs change
Date Thu, 18 Dec 2014 02:20:52 GMT
Resending as my original post got dropped somehow.

Here is in-person discussion follow up. Participants: Moses, wickman,
kevints, maxim.

The proposal we came up with does not require implementing scheduler
health checks (AURORA-279). The idea is to require the executor to
move a task from STARTING to RUNNING only when its health checks are
satisfied. This will make the updater go faster by relying directly on
RUNNING status update, which is now going to be a true reflection of a
healthy user task. The watch_secs will still be useful for updating
tasks without the health checks enabled.

Below is a high level summary of required changes (incomplete).

- Modify task state machine to treat STARTING as a new active
(non-transient) state
- Modify Preemptor to account for STARTING
- Modify stats and SLA metrics to properly account for STARTING
- Modify scheduler updater to short-circuit watch_secs when health
checks are enabled

- Add max_consecutive_successes setting into HealthCheckConfig [1] to
instruct the executor when to move task into RUNNING.

- Modify state transition logic to rely on health checks (if enabled)
to move the task into RUNNING. Transition from STARTING to RUNNING
immediately if task health checks are disabled.

Open question: with STARTING becoming a non-transient state from the
scheduler standpoint, there is nothing to enforce its exit. This may
be OK as STARTING will effectively be a stable user defined state.
However, this is something we may want to cap to avoid adverse user



[1] -

On Sat, Dec 13, 2014 at 11:06 AM, Nakamura <> wrote:
> Hey,
> Just wanted to make sure my email didn't get lost in the cracks.
> As a reminder, the previous emails in this thread were:
> Bill Farner
> <>
> Brian Wickman
> <>
> Best,
> Moses
> On Thu Dec 04 2014 at 11:14:02 AM Nakamura <> wrote:
>> Hey,
>> Sorry that this is replying to my own email, I didn't realize that I had
>> to subscribe to the dev@aurora listserv to get updates.  This email
>> should really be in response to Brian Wickman's response.
>> Hmm, I don't think only sending the transitions is sufficient though.  My
>> concern is that since sending framework messages isn't reliable, we could
>> end up in a situation where the scheduler perceives the task is healthy
>> even though it's not.
>> 1. scheduler spins up executor
>> 2. executor unhealthy
>> 3. executor transitions to healthy, sends message to scheduler
>> 4. scheduler receives healthy message
>> 5. executor transitions to unhealthy before N healthy messages, sends
>> message to scheduler
>> 6. scheduler does not receive unhealthy message
>> 7. after waiting for N messages * time between messages without a
>> response, it assumes that it has remained healthy and marks it as healthy
>> enough to continue.
>> We can fix this by changing 7 to include the check that's currently
>> included in the watch_secs delayed action.
>> Here is my new proposal for how B should work:
>> Executor sends health transitions as framework messages to the
>> scheduler.  When the scheduler receives a transition to healthiness, it
>> waits for N messages * time between messages, and then sends a request to
>> ask if the executor is still healthy.  If the scheduler never sees a
>> healthy message, it defaults to the old behavior, sending a request at
>> watch_secs. Once the scheduler no longer needs the transitions, it tells
>> the executor to stop sending the messages.
>> Thoughts?  Are there any easy ways I can simplify the design?
>> Best,
>> Moses
>> On Tue Dec 02 2014 at 1:53:24 PM Nakamura <> wrote:
>>> Howdy,
>>> I'm interested in tackling AURORA-894, but I'm not terribly familiar with
>>> aurora, so I'd like some feedback on my design before I go forth.
>>> Bill pointed out that the hard bit would be designing the algorithm so it
>>> doesn't DDoS the scheduler, and I think I have an idea of the possible
>>> design space.  I wanted to know what you thought.
>>> A.  sample the number of health checks, and send them back to the
>>> scheduler.  this is pretty simple, but 99% of the time will be total noise,
>>> since the data isn't generally useful.
>>> B.  the executor sends health checks until it receives an out of band
>>> request from the scheduler not to.  this seems fragile (I'm imagining
>>> mismatched executors/schedulers behaving poorly) but would also probably be
>>> reasonably simple.
>>> C.  a slightly more sophisticated approach might be to tell the executor
>>> how many health checks to look for, so that it could send a status update
>>> back, since status updates have reliable delivery.
>>> D. when the scheduler has finished standing up the executor, it
>>> long-polls, which also takes care of reliable delivery because it's
>>> presumably over TCP and we have total control (not having to go through
>>> mesos).
>>> I'm hesitant to do A, because it's so wasteful.  B sounds fragile, so I
>>> don't want to do that one.  D requires long-polling, which your client may
>>> or may not do well.  I'm leaning toward C.  Do you think that sounds like a
>>> reasonable approach?
>>> Thanks,
>>> Moses

View raw message