mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joseph Wu (JIRA)" <>
Subject [jira] [Commented] (MESOS-4306) AGENT_DEAD Message
Date Fri, 08 Jan 2016 01:51:39 GMT


Joseph Wu commented on MESOS-4306:

For random outages, the {{/maintenance/status}} won't change, since only the operator can
trigger these changes.  

When the framework goes to check the machine's status, the machine will either:
# Not show up, if it hasn't been scheduled for maintenance
# Show up as {{DRAINING}}, if it has been scheduled for maintenance, but not taken down by
the operator yet.

> AGENT_DEAD Message
> ------------------
>                 Key: MESOS-4306
>                 URL:
>             Project: Mesos
>          Issue Type: Task
>            Reporter: Gabriel Hartmann
> Frameworks currently receive SLAVE_LOST messages when an Agent fails or is behind a network
partition for some period of time.  However frameworks and indeed Mesos cannot differentiate
between an Agent being temporarily or permanently lost.
> It would be good to have a message indicating that an Agent is lost and won't be returning.
 This would require human intervention so an endpoint should be exposed to induce the sending
of this message.
> This is particularly helpful for frameworks which are waiting for the return of persistent
volumes.  In the case where an Agent hosting significant data (multi terabyte) the framework
may be willing to wait a significant amount of time before repairing its replication factor
(for example).  Explicit human provided information about the permanent state of Agents and
therefore their resources would allow these kinds of frameworks to accelerate their recovery

This message was sent by Atlassian JIRA

View raw message