mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yan Xu (JIRA)" <>
Subject [jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot
Date Thu, 16 Feb 2017 22:55:41 GMT


Yan Xu commented on MESOS-6223:

>From my comment on the email thread:

So one thing that was brought up during offline conversations was that if the host reboot
is associated with hardware change (e.g., a new memory stick):

Currently: the agent would skip the recovery (and the chance of running into incompatible
agent info) and register as a new agent.
With the change: the agent could run into incompatible agent info due to resource change and
flap indefinitely until the operator intervenes.

To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest`
automatically upon recovery failure but only after the host has rebooted. This way the agent
can restart as a new agent without operator intervention. 

Of course, even if we do this to maintain the current behavior, it remain true that relying
on reboot as a signal for hardware change is not reliable but the fix should be MESOS-1739.

> Allow agents to re-register post a host reboot
> ----------------------------------------------
>                 Key: MESOS-6223
>                 URL:
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent
>            Reporter: Megha Sharma
>            Assignee: Megha Sharma
> Agent does’t recover its state post a host reboot, it registers with the master and
gets a new SlaveID. With partition awareness, the agents are now allowed to re-register after
they have been marked Unreachable. The executors are anyway terminated on the agent when it
reboots so there is no harm in letting the agent keep its SlaveID, re-register with the master
and reconcile the lost executors. This is a pre-requisite for supporting persistent/restartable
tasks in mesos (MESOS-3545).

This message was sent by Atlassian JIRA

View raw message