mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Philippe Laflamme <phili...@hopper.com>
Subject Re: Mesos Slave Port Change Fails Recovery
Date Fri, 03 Jul 2015 03:45:40 GMT
Checkpointing has been enabled since 0.18 on these slaves. The only other
setting that changed during the upgrade was that we added --gc_delay=1days.
Otherwise, it's an in-place upgrade without any changes to the work
directory...

Philippe

On Thu, Jul 2, 2015 at 8:59 PM, Vinod Kone <vinodkone@gmail.com> wrote:

> It is surprising that the slave didn't bail out during the initial phase
> of recovery when the port changed. I'm assuming you enabled checkpointing
> in 0.20.0 and that you didn't wipe the meta data directory or anything when
> upgrading to 21.0?
>
> On Thu, Jul 2, 2015 at 3:06 PM, Philippe Laflamme <philippe@hopper.com>
> wrote:
>
>> Here you are:
>>
>> https://gist.github.com/plaflamme/9cd056dc959e0597fb1c
>>
>> You can see in the mesos-master.INFO log that it re-registers the slave
>> using port :5050 (line 9) and fails the health checks on port :5051 (line
>> 10). So it might be the slave that re-uses the old configuration?
>>
>> Thanks,
>> Philippe
>>
>> On Thu, Jul 2, 2015 at 5:54 PM, Vinod Kone <vinodkone@gmail.com> wrote:
>>
>>> Can you paste some logs?
>>>
>>> On Thu, Jul 2, 2015 at 2:23 PM, Philippe Laflamme <philippe@hopper.com>
>>> wrote:
>>>
>>>> Ok, that's reasonable, but I'm not sure why it would successfully
>>>> re-register with the master if it's not supposed to in the first place. I
>>>> think changing the resources (for example) will dump the old configuration
>>>> in the logs and tell you why recovery is bailing out. It's not doing that
>>>> in this case.
>>>>
>>>> I looks as though this doesn't work only because the master can't ping
>>>> the slave on the old port, because the whole recovery process was
>>>> successful otherwise.
>>>>
>>>> I'm not sure if the slave could have picked up its configuration change
>>>> and failed the recovery early, but that would definitely be a better
>>>> experience.
>>>>
>>>> Philippe
>>>>
>>>> On Thu, Jul 2, 2015 at 5:15 PM, Vinod Kone <vinodkone@gmail.com> wrote:
>>>>
>>>>> For slave recovery to work, it is expected to not change its config.
>>>>>
>>>>> On Thu, Jul 2, 2015 at 2:10 PM, Philippe Laflamme <philippe@hopper.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to roll out an upgrade from 0.20.0 to 0.21.0 with slaves
>>>>>> configured with checkpointing and with "reconnect" recovery.
>>>>>>
>>>>>> I was investigating why the slaves would successfully re-register
>>>>>> with the master and recover, but would subsequently be asked to shutdown
>>>>>> ("health check timeout").
>>>>>>
>>>>>> It turns out that our slaves had been unintentionally configured
to
>>>>>> use port 5050 in the previous configuration. We decided to fix that
during
>>>>>> the upgrade and have them use the default 5051 port.
>>>>>>
>>>>>> This change seems to make the health checks fail and eventually kills
>>>>>> the slave due to inactivity.
>>>>>>
>>>>>> I've confirmed that leaving the port to what it was in the previous
>>>>>> configuration makes the slave successfully re-register and is not
asked to
>>>>>> shutdown later on.
>>>>>>
>>>>>> Is this a known issue? I haven't been able to find a JIRA ticket
for
>>>>>> this. Maybe it's the expected behaviour? Should I create a ticket?
>>>>>>
>>>>>> Thanks,
>>>>>> Philippe
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message