mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Niklas Quarfot Nielsen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-741) Add health checking for tasks
Date Wed, 14 May 2014 19:56:16 GMT

    [ https://issues.apache.org/jira/browse/MESOS-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997952#comment-13997952
] 

Niklas Quarfot Nielsen commented on MESOS-741:
----------------------------------------------

First part of this feature has landed in https://reviews.apache.org/r/21052/
Next immediate steps would be to add an optional healthy field to status update.

While discussing this over, we figured that the initial approach where we only notify the
framework when the task has been killed (due to being unhealthy) will most likely be an oversimplification.
We need to be able to provide a status update TASK_RUNNING (healthy) when the task is ready
for operation. With that in mind, two things come to mind:

1) Letting the build/src/mesos-health-check program start a libprocess process checking healthiness
with delay() and returning on failure does not provide enough information to support the scenario
above.
2) For tasks with very variable startup times, it will most likely be more powerful to talk
about the notion of grace period (for unhealthy checks) than initial_delay. The health checker
should be able to send TASK_RUNNING (healthy) as soon as the task is discovered to be healthy.

So I propose that we pass the PID of the executor to the health-checker program and send protobufs
back with the health updates.

Additionally, we also need more flexible ways than HTTP checks alone and a place to start
would be to support 'health check scripts' up front.

Thoughts and input?

> Add health checking for tasks
> -----------------------------
>
>                 Key: MESOS-741
>                 URL: https://issues.apache.org/jira/browse/MESOS-741
>             Project: Mesos
>          Issue Type: Story
>          Components: master, slave
>            Reporter: Niklas Quarfot Nielsen
>            Assignee: Niklas Quarfot Nielsen
>
> Determining the health of a task during its lifetime (during start up, while it is running,
shutting down etc.) can be considered a more elaborate matter than only observing its process
state.
> The task health might be determined by any combination of observable behavior; for example
the process being listening to a certain range of ports, writing certain files or pipes, responding
to messages, utilizing resources to or below certain thresholds etc.
> It could be a powerful extension to extend the interface for launching and running tasks
by an optional HealthCommand message. This message could encode:
> 1) A command to be run at the slave to determine the health of the task. The return value
of the command will tell if the task is healthy or unhealthy. 
> 2) An interval which to run the health command.
> In connection with this, it could make sense to introduce new healthy and unhealthy task
states.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message