mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Greg Mann <g...@mesosphere.io>
Subject Request for Comments - Health Check API Proposal
Date Wed, 17 Oct 2018 18:25:45 GMT
Hi all,
Some users have recently reported issues with our current implementation of
health checks. See this ticket
<https://issues.apache.org/jira/browse/MESOS-6417> for an introduction to
the issue.

To summarize: we currently use a single 'optional bool healthy' field
within the 'TaskStatus' message to indicate the result of a health check.
This allows us to expose 3 health states to users:
1) 'healthy' field is unset = no health check specified, or health check
failed but grace period has not yet elapsed, or health check has not yet
been attempted
2) 'healthy' field is set to 'false' = a health check is specified and it
returned 'false'
3) 'healthy' field is set to 'true' = a health check is specified and it
returned 'true'

The issue is that some users need to distinguish between the three
scenarios in #1: no health check is specified, OR the task is not yet
healthy but we are in the grace period. An example use case would be a load
balancer which needs to wait for a healthy status to route traffic, but
which immediately routes traffic to tasks which have no health check
defined.

This issue was recognized during the design of Mesos generalized checks;
for those checks, we use the presence of the 'check_status' field to
indicate whether or not a check is defined for the task. While consumers
could make use of generalized checks as a workaround, this does not allow
them to both detect the presence of a check AND achieve the task-killing
behavior that health checks provide.

In order to address this, I would like to propose the following new
message, and an addition to the 'TaskStatus' message:

message HealthCheckStatusInfo {
  enum Status {
    UNKNOWN = 0;
    HEALTHY = 1;
    UNHEALTHY = 2;
  }

  required Status status = 0;
}

message TaskStatus {
  . . .

  optional HealthCheckStatusInfo health_check_status = 17;

  . . .
}

The semantics of these fields would be as follows:

'health_status' field:
- If set, a health check has been set
- If unset, a health check has not been set

'health_status.status' field:
- UNKNOWN: The task has not become healthy but is still within its grace
period (this state is also used if an internal error prevents us from
running the health check successfully)
- HEALTHY: The health check indicates the task is healthy
- UNHEALTHY: The health check indicates the task is not healthy

This change would also involve deprecating the existing 'healthy' field. In
accordance with our deprecation policy, I believe we could not remove the
deprecated field until we have a new major release (2.x).

I'd love to hear feedback on this proposal, thanks in advance! I'll also
add this as an agenda item to our upcoming API working group meeting on
Tuesday, Oct. 16 at 11am PST.

Cheers,
Greg

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message