mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Rukletsov <a...@mesosphere.com>
Subject Re: Request for Comments - Health Check API Proposal
Date Thu, 18 Oct 2018 20:40:08 GMT
Why do we need to resolve this note now? It is obvious that health
interpretation must be part of the API. I'm not sure I understand what
concerns you have, Vinod.

On Thu, Oct 18, 2018 at 8:20 PM Vinod Kone <vinodkone@apache.org> wrote:

> I understand and am in agreement that `HealthCheckStatusInfo` will have
> more information than `CheckStatusInfo`.
>
> I would like us to put a little more thought into how that would look like
> to be doubly sure that what we are introducing today will be evolvable into
> that envisioned future. We have to live with API changes for a long time,
> so I would like to see more rigor here (e.g., has the note on top of the
> `HealthCheckStatusInfo` in the doc
> <
> https://docs.google.com/document/d/1VLdaH7i7UDT3_38aOlzTOtH7lwH-laB8dCwNzte0DkU/edit#heading=h.lessdcojxc5v
> >
> has
> been discussed/resolved?) to avoid costly changes/deprecations.
>
> On Thu, Oct 18, 2018 at 4:04 AM Alex Rukletsov <alex@mesosphere.com>
> wrote:
>
> > Thanks for the thoughts, Vinod! Answers inlined.
> >
> > On Wed, Oct 17, 2018 at 8:55 PM Vinod Kone <vinodkone@apache.org> wrote:
> >
> > > One of the things we discussed when we added `CheckInfo` and
> > > `CheckStatusInfo` was to make the older `HealthCheck` and `bool
> healthy`
> > > field (inside `TaskStatus`) consistent with the new `Check` format.
> > >
> > Correct.
> >
> > >
> > > IIRC, some of the changes we wanted to do were
> > >
> > >    - Deprecate `HealthCheck` and introduce a new `HealthCheckInfo`
> proto
> > >
> > Correct.
> >
> > >    - The nested messages inside `HealthCheck` (e.g., `HTTPCheckInfo`)
> >
> >    should be named differently in `HealthCheckInfo` (e.g., `Http`)
> > >
> > Likely, yes.
> >
> > >    - Deprecate `bool healthy` in TaskStatusInfo and introduce a new
> > >    `HealthCheckStatusInfo` which looks similar to `CheckStatusInfo`
> > >
> > Correct.
> >
> > >
> > > Right now, the proposal seems to only address the last point without
> > > addressing the first two, which feels weird to me. I would prefer to
> see
> > > them addressed in one shot.
> > >
> > Can you please explain why? Is there any problem you foresee if we do it
> > step by step? Introducing `HealthCheckStatusInfo` now solves an important
> > problem and does not seem to introduce new issues.
> >
> > >
> > > Additionally, the proposed `HealthCheckStatusInfo` proto looks
> completely
> > > different from `CheckStatusInfo`. Is that intentional? I hope we are
> not
> > > thinking of deprecating it again when we come around to fix
> `HealthCheck`
> > > proto to be consistent with `CheckInfo` ?
> > >
> > How do you think it should look like? Why will we deprecate it?
> >
> > Health checks are different from checks in the way the result of a check
> is
> > interpreted on the agent. In other words health check is an extra step on
> > top of a check. We might include `CheckStatusInfo` or its contents into
> > `HealthCheckStatusInfo`, but... should we think about this now? It is
> nice
> > to have lower level info from the check in the heath status update, but
> it
> > also means more data to transfer. But interpretation—health—we definitely
> > need.
> >
> > Greg, I'm +1 on your proposal.
> >
> > >
> > > Thanks,
> > >
> > > On Wed, Oct 17, 2018 at 1:26 PM Greg Mann <greg@mesosphere.io> wrote:
> > >
> > > > Hi all,
> > > > Some users have recently reported issues with our current
> > implementation
> > > > of health checks. See this ticket
> > > > <https://issues.apache.org/jira/browse/MESOS-6417> for an
> introduction
> > > to
> > > > the issue.
> > > >
> > > > To summarize: we currently use a single 'optional bool healthy' field
> > > > within the 'TaskStatus' message to indicate the result of a health
> > check.
> > > > This allows us to expose 3 health states to users:
> > > > 1) 'healthy' field is unset = no health check specified, or health
> > check
> > > > failed but grace period has not yet elapsed, or health check has not
> > yet
> > > > been attempted
> > > > 2) 'healthy' field is set to 'false' = a health check is specified
> and
> > it
> > > > returned 'false'
> > > > 3) 'healthy' field is set to 'true' = a health check is specified and
> > it
> > > > returned 'true'
> > > >
> > > > The issue is that some users need to distinguish between the three
> > > > scenarios in #1: no health check is specified, OR the task is not yet
> > > > healthy but we are in the grace period. An example use case would be
> a
> > > load
> > > > balancer which needs to wait for a healthy status to route traffic,
> but
> > > > which immediately routes traffic to tasks which have no health check
> > > > defined.
> > > >
> > > > This issue was recognized during the design of Mesos generalized
> > checks;
> > > > for those checks, we use the presence of the 'check_status' field to
> > > > indicate whether or not a check is defined for the task. While
> > consumers
> > > > could make use of generalized checks as a workaround, this does not
> > allow
> > > > them to both detect the presence of a check AND achieve the
> > task-killing
> > > > behavior that health checks provide.
> > > >
> > > > In order to address this, I would like to propose the following new
> > > > message, and an addition to the 'TaskStatus' message:
> > > >
> > > > message HealthCheckStatusInfo {
> > > >   enum Status {
> > > >     UNKNOWN = 0;
> > > >     HEALTHY = 1;
> > > >     UNHEALTHY = 2;
> > > >   }
> > > >
> > > >   required Status status = 0;
> > > > }
> > > >
> > > > message TaskStatus {
> > > >   . . .
> > > >
> > > >   optional HealthCheckStatusInfo health_check_status = 17;
> > > >
> > > >   . . .
> > > > }
> > > >
> > > > The semantics of these fields would be as follows:
> > > >
> > > > 'health_status' field:
> > > > - If set, a health check has been set
> > > > - If unset, a health check has not been set
> > > >
> > > > 'health_status.status' field:
> > > > - UNKNOWN: The task has not become healthy but is still within its
> > grace
> > > > period (this state is also used if an internal error prevents us from
> > > > running the health check successfully)
> > > > - HEALTHY: The health check indicates the task is healthy
> > > > - UNHEALTHY: The health check indicates the task is not healthy
> > > >
> > > > This change would also involve deprecating the existing 'healthy'
> > field.
> > > > In accordance with our deprecation policy, I believe we could not
> > remove
> > > > the deprecated field until we have a new major release (2.x).
> > > >
> > > > I'd love to hear feedback on this proposal, thanks in advance! I'll
> > also
> > > > add this as an agenda item to our upcoming API working group meeting
> on
> > > > Tuesday, Oct. 16 at 11am PST.
> > > >
> > > > Cheers,
> > > > Greg
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message