mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yan Xu <xuj...@apple.com>
Subject Re: MESOS-6233 Allow agents to re-register post a host reboot
Date Tue, 29 Nov 2016 02:09:44 GMT
So one thing that was brought up during offline conversations was that if
the host reboot is associated with hardware change (e.g., a new memory
stick):


   - Currently: the agent would skip the recovery (and the chance of
   running into incompatible agent info) and register as a new agent.
   - With the change: the agent could run into incompatible agent info due
   to resource change and flap
   <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
   indefinitely until the operator intervenes.


To mitigate this and maintain the current behavior, we can have the agent
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
failure but only after the host has rebooted. This way the agent can
restart as a new agent without operator intervention.

Any thoughts?

BTW this speaks to the need for MESOS-1739.

Yan

On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <msharma3@apple.com> wrote:

> Hi All,
>
> We have been working on the design for Restartable tasks (
> MESOS-3545) and allowing agents to recover and re-register post reboot is a
> pre-requisite for that.
> Agent today doesn’t recover its state that includes its SlaveID post a
> host reboot, it short-circuits the recovery upon discovering the reboot and
> registers with the master as a new agent. With Partition Awareness, the
> mesos master even allows agents which have failed master’s health check
> pings (unreachable agents) to re-register with it and reconcile the
> tasks/executors. The executors on a rebooted host are anyway terminated so
> there is no harm in letting such an agent recover and re-register with the
> master using its old SlaveID.
> Would like to hear from the folks here if you see any operational concerns
> with letting the agents recover post a host reboot.
>
> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>
> Many Thanks
> Megha Sharma
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message