mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jie Yu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-2246) Improve slave health-checking
Date Mon, 26 Jan 2015 22:34:34 GMT

    [ https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292534#comment-14292534
] 

Jie Yu commented on MESOS-2246:
-------------------------------

We probably wanna separate this work into two major pieces:

1) Trying to reduce false positives (we thought the slave was dead but actually it is not)
as much as possible. Current health check is based on a series of ping-pongs between the leading
master and slaves, and a magical timeout value. While simple, this might not be the best way
in detecting dead slaves (in terms of false positives). There are a few researches that try
to solve this problem (e.g., gossip protocol, etc.).

2) We need to admit that false positives are inevitable, then the question is how are we going
to handle those false positives. Currently, Mesos handle the false positive by killing all
tasks and remove the slave. We could improve this part by allowing more smart decisions. Some
possible ways are: let framework make those decisions (e.g., be sla aware), or introduce a
few policies for framework to choose, etc.

> Improve slave health-checking
> -----------------------------
>
>                 Key: MESOS-2246
>                 URL: https://issues.apache.org/jira/browse/MESOS-2246
>             Project: Mesos
>          Issue Type: Epic
>          Components: master, slave
>            Reporter: Dominic Hamon
>            Assignee: Jie Yu
>
> In the event of a network partition, or other systemic issues, we may see  widespread
slave removal. There are several approaches we can take to mitigate this issue including,
but not limited to:
> . rate limit the slave removal
> . change how we do health checking to not rely on a single point of view
> . work with frameworks to determine SLA of running services before removing the slave
> . manual control to allow operator intervention 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message