aurora-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Farner <wfar...@apache.org>
Subject Re: RFC HealthCheck
Date Sat, 21 Feb 2015 18:58:19 GMT
Aha, docs in the proto file confirm my read of the implementation:

  // A health check for the task (currently in *alpha* and initial
>   // support will only be for TaskInfo's that have a CommandInfo).
>   optional HealthCheck health_check = 8;


-=Bill

On Sat, Feb 21, 2015 at 10:47 AM, Bill Farner <wfarner@apache.org> wrote:

> To answer OP:
>
> (1) seems perfectly reasonable, i don't foresee any pitfalls
>
> (2) seems reasonable as well.  Thrift unions help a bit here.  Just
> spitballing, but this general arrangement comes to mind:
>
> struct TcpCheck {
>   ...
> }
>
> struct HttpStatusCheck {
>   ...
> }
>
> struct HttpPayloadCheck {
>   ...
> }
>
>
> union HttpCheckCriteria {
>
>   1: HttpStatusCheck status
>
>   2: HttpPayloadCheck payload
>
> }
>
>
> struct HttpCheck {
>   ...
>
>   n: set<HttpCheckCriteria> criteria
>
> }
>
> union HealthCheck {
>   1: TcpCheck tcp
>   2: HttpCheck http
> }
>
>
> We could obviously get pretty complicated with this if we choose to, but
> starting with some opinionated defaults and an extensible structure may be
> key.
>
> I also agree that graceful teardown should be decoupled from health checks.
>
> -=Bill
>
> On Sat, Feb 21, 2015 at 10:35 AM, Bill Farner <wfarner@apache.org> wrote:
>
>> If i'm reading the code correctly, the only way to use mesos' health
>> checks is with the command executor?  Can somebody check my work on that?
>>
>> Some other context around health checks to keep in mind:
>> - there is a review [1] in-flight for the executor to delay the
>> transition to RUNNING until the first positive health check [2]
>> - we want to make the scheduler the authority for reacting to health
>> check failures [3].  this is a very real concern for large services to
>> avoid simultaneous failures
>>
>> [1] https://reviews.apache.org/r/31104/
>> [2] https://issues.apache.org/jira/browse/AURORA-894
>> [3] https://issues.apache.org/jira/browse/AURORA-279
>>
>>
>> -=Bill
>>
>> On Sat, Feb 21, 2015 at 3:48 AM, Erb, Stephan <
>> Stephan.Erb@blue-yonder.com> wrote:
>>
>>> Hi Florian,
>>>
>>> have you looked at what Mesos is already offering out of the box [1]?
>>> Maybe there is a way to implement your features by relying on Mesos
>>> directly, instead of making the Aurora implementation more flexible.
>>>
>>> As you've mentioned, the  lifecycle endpoints abort and quit seem to be
>>> quite orthogonal to the health checking idea. I would be in favor of
>>> separating the different concepts. I even thought about this yesterday,
>>> because in our environment we only want health checking but now also have
>>> to pay a  price of 10secs additional latency when stopping jobs due the
>>> graceful kill escalation.
>>>
>>> [1]
>>> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141
>>>
>>>
>>> Regards,
>>> Stephan
>>>
>>> ________________________________________
>>> From: Florian Pfeiffer <florian.pfeiffer@gutefrage.net>
>>> Sent: Saturday, February 21, 2015 4:27 AM
>>> To: dev@aurora.incubator.apache.org
>>> Subject: RFC HealthCheck
>>>
>>> Hi,
>>>
>>> I would like to start working on the Healthchecker
>>>
>>> 1) Enable configuration of the portname to which run health checks on
>>> (this should also tackle AURORA-321 )
>>> This seems like a very small change consisting of adding a new variable
>>> named „port“ to the HealthCheckConfig  in base.py with a default value of
>>> „health“ to be backwards compatible. Any pitfalls? Any objections?
>>>
>>> 2) There’s at least one ticket in jira that’s about making the endpoints
>>> for the health check configurable. I would like to have a health check that
>>> works on HTTP Status Codes, and there might be other people that are fine
>>> with a health check that works on checking if it’s possible to make a TCP
>>> connection
>>>
>>> For my use case I would probably be fine, if I add a variable „method“
>>> to the HealthCheckConfig, with a  default value of „classic“ for the
>>> current behavior and s.th<http://s.th>. like „statuscode“ for a check
>>> that’s very very similar to the current one in http_signaler.py but instead
>>> of parsing the response checks the status code (with the downside that the
>>> endpoints /health /abort /quitquitquit are still hardcoded)
>>>
>>> Any ideas how this can be a little bit more generic, so that if we have
>>> 3-5 different types of health checks we can have different arguments to
>>> each health check? (e.g. expected_response for the current one,
>>> expected_code for the status code checker, and maybe s.th<http://s.th>.
>>> like max_response_time for defining how fast traffic has to appear on a tcp
>>> connection check)
>>>
>>>
>>> A side question: for me it seems like /health and (/abort &
>>> /quitquitquit) are not very closely related. Does it make sense to have
>>> those 3 things grouped in the HealthCheck?
>>>
>>>
>>> Best,
>>> Florian
>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message