mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evers Benno <ben...@yandex-team.ru>
Subject Re: Registering and framework failover
Date Thu, 14 Jul 2016 10:32:00 GMT
So, given that this probably won't be changed before the 1.0 release,
are the strings considered part of the stable API? Or is it recommended
not to rely on `error()` at all? (That's what we did for now, setting
failover timeout to 5 years)

On 13.07.2016 15:37, Neil Conway wrote:
> Ah, right -- yes, at the moment you need to look at error strings to
> decide whether to retry with a new framework ID, unfortunately. IMO we
> should introduce error codes or enums to make this process more
> reliable, but no one has done so yet:
> 
> https://issues.apache.org/jira/browse/MESOS-4548
> https://issues.apache.org/jira/browse/MESOS-5322
> 
> Neil
> 
> 
> On Wed, Jul 13, 2016 at 3:27 PM, Evers Benno <bennoe@yandex-team.ru> wrote:
>> Let me try to clarify:
>>
>> The problem is that I don't get to decide manually if the framwork
>> should try to take a new id or re-use the old one, but it needs to be
>> decided programmatically, by an algorithm.
>>
>> Afaik it's not possible to get the time when the framework disconnected
>> from mesos, so it's not possible to know how much time is left until the
>> failover timeout runs out. Therefore, if I want to attempt task
>> reconciliation, I just have to try registering with my old framework id
>> and see what happens.
>>
>> However, in the case where the failover timeout already passed, I now
>> need to programmatically detect this error and try again with an empty
>> framework id.
>>
>> My question was, is it possible to do this?
>>
>> (also, we actually use a failover timeout of 1 week, but it doesn't
>> really change the problem and I mistakenly assumed that an example with
>> smaller values would be more intuitive)
>>
>> On 13.07.2016 14:50, Neil Conway wrote:
>>> On Wed, Jul 13, 2016 at 2:44 PM, Evers Benno <bennoe@yandex-team.ru> wrote:
>>>> imagine the following situation: I am a framework with failover timeout
>>>> of 1 hour, and 59 minutes and 55 seconds after shutting down I want to
>>>> register with the master again.
>>>>
>>>> If my registration attempt arrives at the master within the time limit
>>>> everything will be fine and I even get back the old tasks for
>>>> reconciliation, but if it arrives slightly later the framework id is
>>>> permanently blocked by mesos, and I am not able to register. Instead, I
>>>> will receive an error()-callback with the message "Framework has been
>>>> removed".
>>>
>>> Right: if you set a failover_timeout of 1 hour, your framework is
>>> expected to reregister within one hour. If it does not, all of its
>>> tasks will be killed and you need to start over with a new
>>> FrameworkID. Can you clarify which aspect of this behavior is
>>> problematic for you?
>>>
>>> Note that a failover_timeout of 1 hour is probably a little low.
>>>
>>>> Is there any way to reliably connect to the master while also
>>>> reconciling old tasks if possible?
>>>
>>> Sorry, not sure what you mean by this.
>>>
>>> Neil
>>>

Mime
View raw message