mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evers Benno <ben...@yandex-team.ru>
Subject Re: Registering and framework failover
Date Wed, 13 Jul 2016 13:27:50 GMT
Let me try to clarify:

The problem is that I don't get to decide manually if the framwork
should try to take a new id or re-use the old one, but it needs to be
decided programmatically, by an algorithm.

Afaik it's not possible to get the time when the framework disconnected
from mesos, so it's not possible to know how much time is left until the
failover timeout runs out. Therefore, if I want to attempt task
reconciliation, I just have to try registering with my old framework id
and see what happens.

However, in the case where the failover timeout already passed, I now
need to programmatically detect this error and try again with an empty
framework id.

My question was, is it possible to do this?

(also, we actually use a failover timeout of 1 week, but it doesn't
really change the problem and I mistakenly assumed that an example with
smaller values would be more intuitive)

On 13.07.2016 14:50, Neil Conway wrote:
> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <bennoe@yandex-team.ru> wrote:
>> imagine the following situation: I am a framework with failover timeout
>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>> register with the master again.
>>
>> If my registration attempt arrives at the master within the time limit
>> everything will be fine and I even get back the old tasks for
>> reconciliation, but if it arrives slightly later the framework id is
>> permanently blocked by mesos, and I am not able to register. Instead, I
>> will receive an error()-callback with the message "Framework has been
>> removed".
> 
> Right: if you set a failover_timeout of 1 hour, your framework is
> expected to reregister within one hour. If it does not, all of its
> tasks will be killed and you need to start over with a new
> FrameworkID. Can you clarify which aspect of this behavior is
> problematic for you?
> 
> Note that a failover_timeout of 1 hour is probably a little low.
> 
>> Is there any way to reliably connect to the master while also
>> reconciling old tasks if possible?
> 
> Sorry, not sure what you mean by this.
> 
> Neil
> 

Mime
View raw message