aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nakamura <>
Subject Re: aurora watch_secs change
Date Thu, 04 Dec 2014 19:14:03 GMT

Sorry that this is replying to my own email, I didn't realize that I had to
subscribe to the dev@aurora listserv to get updates.  This email should
really be in response to Brian Wickman's response.

Hmm, I don't think only sending the transitions is sufficient though.  My
concern is that since sending framework messages isn't reliable, we could
end up in a situation where the scheduler perceives the task is healthy
even though it's not.

1. scheduler spins up executor
2. executor unhealthy
3. executor transitions to healthy, sends message to scheduler
4. scheduler receives healthy message
5. executor transitions to unhealthy before N healthy messages, sends
message to scheduler
6. scheduler does not receive unhealthy message
7. after waiting for N messages * time between messages without a response,
it assumes that it has remained healthy and marks it as healthy enough to

We can fix this by changing 7 to include the check that's currently
included in the watch_secs delayed action.

Here is my new proposal for how B should work:

Executor sends health transitions as framework messages to the scheduler.
When the scheduler receives a transition to healthiness, it waits for N
messages * time between messages, and then sends a request to ask if the
executor is still healthy.  If the scheduler never sees a healthy message,
it defaults to the old behavior, sending a request at watch_secs. Once the
scheduler no longer needs the transitions, it tells the executor to stop
sending the messages.

Thoughts?  Are there any easy ways I can simplify the design?


On Tue Dec 02 2014 at 1:53:24 PM Nakamura <> wrote:

> Howdy,
> I'm interested in tackling AURORA-894, but I'm not terribly familiar with
> aurora, so I'd like some feedback on my design before I go forth.
> Bill pointed out that the hard bit would be designing the algorithm so it
> doesn't DDoS the scheduler, and I think I have an idea of the possible
> design space.  I wanted to know what you thought.
> A.  sample the number of health checks, and send them back to the
> scheduler.  this is pretty simple, but 99% of the time will be total noise,
> since the data isn't generally useful.
> B.  the executor sends health checks until it receives an out of band
> request from the scheduler not to.  this seems fragile (I'm imagining
> mismatched executors/schedulers behaving poorly) but would also probably be
> reasonably simple.
> C.  a slightly more sophisticated approach might be to tell the executor
> how many health checks to look for, so that it could send a status update
> back, since status updates have reliable delivery.
> D. when the scheduler has finished standing up the executor, it
> long-polls, which also takes care of reliable delivery because it's
> presumably over TCP and we have total control (not having to go through
> mesos).
> I'm hesitant to do A, because it's so wasteful.  B sounds fragile, so I
> don't want to do that one.  D requires long-polling, which your client may
> or may not do well.  I'm leaning toward C.  Do you think that sounds like a
> reasonable approach?
> Thanks,
> Moses

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message