aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nakamura <>
Subject Re: aurora watch_secs change
Date Sat, 13 Dec 2014 19:06:28 GMT
Just wanted to make sure my email didn't get lost in the cracks.

As a reminder, the previous emails in this thread were:
Bill Farner
Brian Wickman


On Thu Dec 04 2014 at 11:14:02 AM Nakamura <> wrote:

> Hey,
> Sorry that this is replying to my own email, I didn't realize that I had
> to subscribe to the dev@aurora listserv to get updates.  This email
> should really be in response to Brian Wickman's response.
> Hmm, I don't think only sending the transitions is sufficient though.  My
> concern is that since sending framework messages isn't reliable, we could
> end up in a situation where the scheduler perceives the task is healthy
> even though it's not.
> 1. scheduler spins up executor
> 2. executor unhealthy
> 3. executor transitions to healthy, sends message to scheduler
> 4. scheduler receives healthy message
> 5. executor transitions to unhealthy before N healthy messages, sends
> message to scheduler
> 6. scheduler does not receive unhealthy message
> 7. after waiting for N messages * time between messages without a
> response, it assumes that it has remained healthy and marks it as healthy
> enough to continue.
> We can fix this by changing 7 to include the check that's currently
> included in the watch_secs delayed action.
> Here is my new proposal for how B should work:
> Executor sends health transitions as framework messages to the
> scheduler.  When the scheduler receives a transition to healthiness, it
> waits for N messages * time between messages, and then sends a request to
> ask if the executor is still healthy.  If the scheduler never sees a
> healthy message, it defaults to the old behavior, sending a request at
> watch_secs. Once the scheduler no longer needs the transitions, it tells
> the executor to stop sending the messages.
> Thoughts?  Are there any easy ways I can simplify the design?
> Best,
> Moses
> On Tue Dec 02 2014 at 1:53:24 PM Nakamura <> wrote:
>> Howdy,
>> I'm interested in tackling AURORA-894, but I'm not terribly familiar with
>> aurora, so I'd like some feedback on my design before I go forth.
>> Bill pointed out that the hard bit would be designing the algorithm so it
>> doesn't DDoS the scheduler, and I think I have an idea of the possible
>> design space.  I wanted to know what you thought.
>> A.  sample the number of health checks, and send them back to the
>> scheduler.  this is pretty simple, but 99% of the time will be total noise,
>> since the data isn't generally useful.
>> B.  the executor sends health checks until it receives an out of band
>> request from the scheduler not to.  this seems fragile (I'm imagining
>> mismatched executors/schedulers behaving poorly) but would also probably be
>> reasonably simple.
>> C.  a slightly more sophisticated approach might be to tell the executor
>> how many health checks to look for, so that it could send a status update
>> back, since status updates have reliable delivery.
>> D. when the scheduler has finished standing up the executor, it
>> long-polls, which also takes care of reliable delivery because it's
>> presumably over TCP and we have total control (not having to go through
>> mesos).
>> I'm hesitant to do A, because it's so wasteful.  B sounds fragile, so I
>> don't want to do that one.  D requires long-polling, which your client may
>> or may not do well.  I'm leaning toward C.  Do you think that sounds like a
>> reasonable approach?
>> Thanks,
>> Moses

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message