mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sharma Podila <>
Subject Re: What happens if a scheduler registers with a framework ID that hasn't been used in 48 hours?
Date Mon, 21 Apr 2014 22:10:05 GMT
On a related note, what if framework scheduler is up while Mesos master
goes down. Then, if Mesos master restarts after a time interval greater
than framework failover timeout, what is the expected behavior? Would the
framework successfully get a re-registered() callback? Or error() callback?

On Fri, Apr 18, 2014 at 10:54 AM, Vinod Kone <> wrote:

> I think you are on the right track here.
> I would recommend setting a high failover timeout that is an upper bound
> for all of your schedulers being down (e.g., 1 week). This way, even if all
> your scheduler instances are down due to outage/maintenance, your
> tasks/services keep running in the Mesos cluster.
> On Fri, Apr 18, 2014 at 5:02 AM, David Greenberg <>wrote:
>> Hey Vinod,
>> The problem I'm trying to solve is writing a framework that can run on
>> our HA application cluster, and whenever the framework's current scheduler
>> dies, another node will be elected and take over. I'm trying to work
>> through the various failure cases to understand how implement this so that
>> it works through all the failure cases I can think of.
>> It sounds like the solution that'd work best for me would be to try to
>> read the framework ID from a known location and register with that. If it's
>> not there, or if registration fails, then the framework should register
>> anew.
>> This framework's state is very large, and resides in a couple databases,
>> so that even if the entire set of candidates for becoming the framework is
>> down for the whole failover grave period, the framework still wants to
>> register, since it's state never gets invalidated.
>> Thanks,
>> David
>> On Thursday, April 17, 2014, Vinod Kone <> wrote:
>>> On Thu, Apr 17, 2014 at 2:56 PM, David Greenberg <
>>> > wrote:
>>>> My follow-up question is this--is there a way to tell whether I'm
>>>> outside of the timeout window? I'd like to have my framework check ZK and
>>>> determine whether it's w/in the framework timeout or not, so that it can
>>>> make the correct call.
>>> Hey David,
>>> Currently, the only signal you can get is by hitting "/state.json"
>>> endpoint on the master. The framework should've been moved to
>>> 'completed_frameworks' after the failover timeout. Of course, if a master
>>> fails over this information is lost so you can't reliably depend on it.
>>> When master starts storing persistent state about frameworks (likely
>>> couple of releases away), a re-registration attempt in such a case would be
>>> denied by the master. So that could be your signal. Alternatively, with
>>> persistence, you could also more reliably depend on "/state.json" to get
>>> this info.
>>> To take a step back, what is the problem you are trying to solve?
>>> Thanks,

View raw message