mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Mahler <>
Subject Re: Messaging reliability in Mesos
Date Fri, 06 Sep 2013 19:10:53 GMT
I created so that we can
document this for framework developers. In 0.15.0, lost updates will be
fairly minimal so we consider documenting any known cases where these can

On Thu, Sep 5, 2013 at 3:20 PM, Vinod Kone <> wrote:

> tl:dr; If the master fails over when a slave fails, there is a (small)
> chance that status updates of that slave are not reliably sent to the
> scheduler.
> In the earlier versions (pre 0.14.0) of mesos, when the master fails over
> at the same time as a slave failure, pending status updates of that slave
> were not sent to the scheduler.
> In 0.14.0, we are introducing a new feature called "Slave Recovery" where
> slaves checkpoint status updates information to disk. This increases the
> reliability of status updates. But there are still some cases where updates
> are not reliably retried (e.g., master fails over when the slave fails but
> slave never comes back up).
> In 0.15.0, we plan to introduce another feature called "Registrar" where
> masters checkpoint slave info to durable storage. This reduces the
> probability of lost updates even further. Unfortunately, even this wouldn't
> give a 100% reliability guarantee on the delivery of status updates.

In this case, we will be able to send slaveLost which is strictly better
than sending nothing. Schedulers could act on this signal in 0.15.0. The
caveat here is that we will have to consider making the sending of
slaveLost reliable, otherwise if the scheduler is failing over, they will
be dropped.

> On Thu, Sep 5, 2013 at 2:54 PM, Li Jin <> wrote:
>> Hi Mesosers,
>> I am wondering how reliable is messaging in Mesos. I didn't find any
>> documentation about it. For instances, are the schedulers guaranteed to
>> receive task status no matter what? Even when a TASK_FINISHED message is
>> sent when the master is failing over? There are probably too many failure
>> cases so I just want to have a general idea.
>> Thanks,
>> Li

View raw message