mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Xu <>
Subject Re: MESOS-6233 Allow agents to re-register post a host reboot
Date Tue, 29 Nov 2016 02:09:44 GMT
So one thing that was brought up during offline conversations was that if
the host reboot is associated with hardware change (e.g., a new memory

   - Currently: the agent would skip the recovery (and the chance of
   running into incompatible agent info) and register as a new agent.
   - With the change: the agent could run into incompatible agent info due
   to resource change and flap
   indefinitely until the operator intervenes.

To mitigate this and maintain the current behavior, we can have the agent
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
failure but only after the host has rebooted. This way the agent can
restart as a new agent without operator intervention.

Any thoughts?

BTW this speaks to the need for MESOS-1739.


On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <> wrote:

> Hi All,
> We have been working on the design for Restartable tasks (
> MESOS-3545) and allowing agents to recover and re-register post reboot is a
> pre-requisite for that.
> Agent today doesn’t recover its state that includes its SlaveID post a
> host reboot, it short-circuits the recovery upon discovering the reboot and
> registers with the master as a new agent. With Partition Awareness, the
> mesos master even allows agents which have failed master’s health check
> pings (unreachable agents) to re-register with it and reconcile the
> tasks/executors. The executors on a rebooted host are anyway terminated so
> there is no harm in letting such an agent recover and re-register with the
> master using its old SlaveID.
> Would like to hear from the folks here if you see any operational concerns
> with letting the agents recover post a host reboot.
> Many Thanks
> Megha Sharma

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message