mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tobias Weingartner (JIRA)" <>
Subject [jira] [Commented] (MESOS-1529) Handle a network partition between Master and Slave
Date Wed, 25 Jun 2014 18:14:25 GMT


Tobias Weingartner commented on MESOS-1529:

2) What does an "exit" event signify?  Why would we need to check that it was for a leading

3) How is the 75 seconds determined?  Does this lock us into a phased upgrade path if this
timeout value needs to change?  If we get a ping from a non-leading master, we should likely
ignore it and not immediately trigger re-registration.  IE: let the timeout take effect.

> Handle a network partition between Master and Slave
> ---------------------------------------------------
>                 Key: MESOS-1529
>                 URL:
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Dominic Hamon
> If a network partition occurs between a Master and Slave, the Master will remove the
Slave (as it fails health check) and mark the tasks being run there as LOST. However, the
Slave is not aware that it has been removed so the tasks will continue to run.
> (To clarify a little bit: neither the master nor the slave receives 'exited' event, indicating
that the connection between the master and slave is not closed).
> There are at least two possible approaches to solving this issue:
> 1. Introduce a health check from Slave to Master so they have a consistent view of a
network partition. We may still see this issue should a one-way connection error occur.
> 2. Be less aggressive about marking tasks and Slaves as lost. Wait until the Slave reappears
and reconcile then. We'd still need to mark Slaves and tasks as potentially lost (zombie state)
but maybe the Scheduler can make a more intelligent decision.

This message was sent by Atlassian JIRA

View raw message