aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Khutornenko <ma...@apache.org>
Subject Re: Heartbeat mechanism auditing
Date Thu, 29 Jan 2015 22:45:43 GMT
To add a bit of history to the topic, the current design has been
debated heavily here [1] and an active/lazy consensus was reached
around implementing the first iteration as lightweight as possible
without persisting any durable state.

My take on this - we should proceed as originally proposed given the following:

- History of heartbeats is the only feature that requires state
persistence. Nothing else in the current design benefits from
persisting the state across restarts. I consider pulse history as a
nice to have rather than a requirement (unlike the current state
reporting, which is a must for troubleshooting and is racked by
AURORA-1049).

- State persistence will come with additional complexity of handling
corner cases (restart, abort, resume, etc.) that is not well justified
at this point given our total lack of experience with heartbeats.

- Adding pulse history tracking can be done at later stages (as the
feature evolves and we gain more insight) without the adverse user
impact or technical debt. On the contrary, if attempted early the
overlooked details may hurt down the road by requiring Thrift schema
migration.

Thanks,
Maxim

[1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser

On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin
<dmclaughlin@apache.org> wrote:
> Hi all,
>
> There is a little bit of a stalemate with regards to the implementation of
> the pulse RPC in the scheduler.
>
> As a brief overview of this feature - the pulse RPC is designed so that an
> external service can monitor the new in-scheduler updates reliably. This
> external service could be doing something like keeping an eye on
> application level alerts and pausing the update if things slip into a bad
> state. The purpose of the pulse is to make sure the update does not
> continue if it's not being monitored (i.e. the external service might have
> failed) by requiring positive acknowledgement at a given time interval.
>
> The implementation is in this review: https://reviews.apache.org/r/30225/
>
> The contention is around whether or not the "blocked" state deserves its
> own explicit state in the update state machine, and whether this is
> important enough to block the review. Currently any blocked updates are
> only known to the scheduler and the update will show as
> UPDATING/ROLLING_FORWARD in the UI and any history that the update was
> blocked will be lost - we only track current state.
>
> If you have any opinions on this feature, please feel free to chime in to
> the RB!
>
> Thanks,
> David

Mime
View raw message