aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: RFC HealthCheck
Date Sat, 21 Feb 2015 18:35:24 GMT
If i'm reading the code correctly, the only way to use mesos' health checks
is with the command executor?  Can somebody check my work on that?

Some other context around health checks to keep in mind:
- there is a review [1] in-flight for the executor to delay the transition
to RUNNING until the first positive health check [2]
- we want to make the scheduler the authority for reacting to health check
failures [3].  this is a very real concern for large services to avoid
simultaneous failures

[1] https://reviews.apache.org/r/31104/
[2] https://issues.apache.org/jira/browse/AURORA-894
[3] https://issues.apache.org/jira/browse/AURORA-279


-=Bill

On Sat, Feb 21, 2015 at 3:48 AM, Erb, Stephan <Stephan.Erb@blue-yonder.com>
wrote:

> Hi Florian,
>
> have you looked at what Mesos is already offering out of the box [1]?
> Maybe there is a way to implement your features by relying on Mesos
> directly, instead of making the Aurora implementation more flexible.
>
> As you've mentioned, the  lifecycle endpoints abort and quit seem to be
> quite orthogonal to the health checking idea. I would be in favor of
> separating the different concepts. I even thought about this yesterday,
> because in our environment we only want health checking but now also have
> to pay a  price of 10secs additional latency when stopping jobs due the
> graceful kill escalation.
>
> [1]
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141
>
>
> Regards,
> Stephan
>
> ________________________________________
> From: Florian Pfeiffer <florian.pfeiffer@gutefrage.net>
> Sent: Saturday, February 21, 2015 4:27 AM
> To: dev@aurora.incubator.apache.org
> Subject: RFC HealthCheck
>
> Hi,
>
> I would like to start working on the Healthchecker
>
> 1) Enable configuration of the portname to which run health checks on
> (this should also tackle AURORA-321 )
> This seems like a very small change consisting of adding a new variable
> named „port“ to the HealthCheckConfig  in base.py with a default value of
> „health“ to be backwards compatible. Any pitfalls? Any objections?
>
> 2) There’s at least one ticket in jira that’s about making the endpoints
> for the health check configurable. I would like to have a health check that
> works on HTTP Status Codes, and there might be other people that are fine
> with a health check that works on checking if it’s possible to make a TCP
> connection
>
> For my use case I would probably be fine, if I add a variable „method“ to
> the HealthCheckConfig, with a  default value of „classic“ for the current
> behavior and s.th<http://s.th>. like „statuscode“ for a check that’s very
> very similar to the current one in http_signaler.py but instead of parsing
> the response checks the status code (with the downside that the endpoints
> /health /abort /quitquitquit are still hardcoded)
>
> Any ideas how this can be a little bit more generic, so that if we have
> 3-5 different types of health checks we can have different arguments to
> each health check? (e.g. expected_response for the current one,
> expected_code for the status code checker, and maybe s.th<http://s.th>.
> like max_response_time for defining how fast traffic has to appear on a tcp
> connection check)
>
>
> A side question: for me it seems like /health and (/abort & /quitquitquit)
> are not very closely related. Does it make sense to have those 3 things
> grouped in the HealthCheck?
>
>
> Best,
> Florian
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message