aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Proposal: External Update Coordination
Date Mon, 13 Oct 2014 23:55:05 GMT
The main reason I preferred the lack-of-ACK approach over an explicit
NACK one is simplicity. As Joshua pointed out there is more state to
handle in that case. The lack-of-ACK model can be completely
implemented in volatile memory sidestepping the persistent storage
entirely. With the NACK we would need to reliably persist external
service call reasons to survive scheduler failovers. Not a huge
challenge but something to keep in mind.

I still think the simplicity/reliability tradeoff is acceptable here
if we rely on external service to abort heartbeats in case of a health
alert fired. This can be explicitly documented as an external
integration requirement. However, If the consensus is to go a more
reliable (though more complicated) NACK route I am happy to reconsider
the current proposal.

On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <jcohen@twopensource.com> wrote:
> "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.  If
> we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
> to resume when we receive another ACK.  In other words, a service toggling
> unhealthy might not be deemed safe to proceed."
>
> Lack-of-ACK is the scenario where connectivity between the monitor and the
> scheduler is unavailable. Shouldn't the NACK scenario (everything is not
> ok!) be handled by the monitoring service triggering an explicit pause?
> I.e. section 2 should be updated to say "External service detects service
> health problems and pauses the update" and section 4 becomes the current
> section 2 (i.e. "Should a heartbeat not be received the scheduler pauses
> the update.").
>
> I agree that it's unsafe to to resume updates after receiving a heartbeat
> after previously pausing due to a missed heartbeat. In that scenario I'd
> think we'd want an explicit resumeJobUpdate. If the scenario we're trying
> to handle is *never* received a heartbeat, that's a separate matter, in
> that case unpausing upon receiving the first heartbeat would make sense,
> but it feels like that complicates things quite a bit (now we need to
> differentiate between heartbeat #1 and hearbeat #N).
>
> On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfarner@apache.org> wrote:
>
>> What is the guidance for deploying while the heartbeat service is broken?
>> I think i know the answer, but it's important to spell out.
>>
>>
>>
>> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
>> > state to avoid any progress until the first heartbeat call arrives.
>>
>>
>> I'm not sold on this being ultimately beneficial.  In the worst case,
>> impact is still limited by the health check threshold.  Seems like
>> premature optimization at best, and an odd one if we proceed without a
>> 'NACK' signal via the heartbeatJobUpdate RPC.
>>
>> Allow resuming of the paused-due-to-no-heartbeat update via a
>> > resumeJobUpdate call.
>>
>>
>> Are heartbeats required while rolling back?  If so, that might impact the
>> design here and in other places.
>>
>> Allow resuming of the paused-due-to-no-heartbeat update via a fresh
>> > heartbeatJobUpdate call.
>>
>>
>> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.  If
>> we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
>> to resume when we receive another ACK.  In other words, a service toggling
>> unhealthy might not be deemed safe to proceed.
>>
>> Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused job
>> > update would make more sense as there is nothing monitoring service could
>> > do in that case. This should work fine with pause/resume -aware/-agnostic
>> > monitoring service implementation.
>>
>>
>> This seems reasonable to me - heartbeats for a paused update should not
>> pose a risk, but can be safely ignored.
>>
>>
>>
>> -=Bill
>>
>> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <maxim@apache.org>
>> wrote:
>>
>> > Agreed. That would be a logical generalization of the post failover
>> > behavior.
>> >
>> > I have updated the above document with the following changes:
>> > - Reply with PAUSED any time a job was paused by user;
>> > - Start in paused state by default.
>> >
>> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevints@apache.org>
>> > wrote:
>> > > The doc mentioned that the scheduler will start an update subject to
>> the
>> > > heartbeat countdown, and if it doesn't receive a heartbeat it will
>> pause
>> > > the update. Why not start with the update paused-due-to-no-heartbeat to
>> > > fail-fast any connectivity issues between the service providing the
>> > > heartbeats and the scheduler?
>> > >
>> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <maxim@apache.org>
>> > > wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> We are proposing a new feature for the scheduler updater, which you
>> > >> may find helpful.
>> > >>
>> > >> I have posed a brief feature summary here:
>> > >>
>> > >>
>> >
>> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
>> > >>
>> > >> Please, reply with your feedback/concerns/comments.
>> > >>
>> > >> Thanks,
>> > >> Maxim
>> > >>
>> >
>>

Mime
View raw message