mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Conway (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-6078) Add a agent teardown endpoint
Date Tue, 25 Oct 2016 19:39:58 GMT

    [ https://issues.apache.org/jira/browse/MESOS-6078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606261#comment-15606261
] 

Neil Conway commented on MESOS-6078:
------------------------------------

FYI, we will likely address this as part of the in-progress work on supporting {{TASK_GONE}}
and {{TASK_GONE_BY_OPERATOR}}. Workflow:

* framework opts-in to the {{PARTITION_AWARE}} capability.
* if Mesos can _prove_ that the agent ID is gone (e.g., because the agent reboots, changes
its boot ID, and then an agent using the same {{work_dir}} registers and receives a new agent
ID), the framework will get {{TASK_GONE}} status updates for all tasks on the agent.
* if the operator has some out-of-band knowledge that the agent will never attempt to re-register
and all of its tasks are no longer running, we'll provide an operator HTTP endpoint (e.g.,
/agent/gone) that the operator can hit. When this happens, the framework will receive {{TASK_GONE_BY_OPERATOR}}
status updates for all tasks on the agent.

In the meantime, the {{/machine/down}} endpoint might help here -- it shouldn't be subject
to the agent removal rate limit.

> Add a agent teardown endpoint
> -----------------------------
>
>                 Key: MESOS-6078
>                 URL: https://issues.apache.org/jira/browse/MESOS-6078
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 1.0.0, 1.0.1
>            Reporter: Cody Maloney
>            Assignee: Michael Park
>              Labels: mesosphere
>
> Currently, when a whole agent machine is unexpectedly terminated for good (AWS terminated
the instance without warning), it goes through the mesos slave removal rate limit before it's
gone.
> If a couple agents / a whole rack goes in a cluster of thousands of agents, this can
get to be a problem.
> If the agent can be shutdown "cleanly" everything would get scheduled, but once the agent
is gone, there currently is no good way for an adminitstrator to indicate the node is gone
/ gone and it's tasks are lost / should be rescheduled if appropriate as soon as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message