mesos-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Mahler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.
Date Thu, 03 Sep 2015 00:04:46 GMT

    [ https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728257#comment-14728257
] 

Benjamin Mahler commented on MESOS-1474:
----------------------------------------

What is a framework upgrade in this context, and why does it require tasks to be drained?

> Provide cluster maintenance primitives for operators.
> -----------------------------------------------------
>
>                 Key: MESOS-1474
>                 URL: https://issues.apache.org/jira/browse/MESOS-1474
>             Project: Mesos
>          Issue Type: Epic
>          Components: framework, master, slave
>            Reporter: Benjamin Mahler
>            Assignee: Artem Harutyunyan
>              Labels: mesosphere, twitter
>
> Sometimes operators need to perform maintenance on a mesos cluster; we define maintenance
here as anything that requires the tasks to be drained on the slave(s). Most mesos upgrades
can be done without affecting running tasks, but there are situations where maintenance is
task-affecting:
> * Host maintenance (e.g. hardware repair, kernel upgrades).
> * Non-recoverable slave upgrades (e.g. adjusting slave attributes).
> * etc
> In order to ensure operators don’t violate frameworks’ SLAs, schedulers need to be
aware of planned unavailability events.
> Maintenance awareness allows schedulers to avoid churn for long running tasks by placing
them on machines not undergoing maintenance. If all resources are planned for maintenance,
then the scheduler will prefer machines scheduled for maintenance least imminently.
> Maintenance awareness is also crucial when a scheduler uses [persistent disk|https://issues.apache.org/jira/browse/MESOS-1554]
resources, to ensure that the scheduler is aware of the expected duration of unavailability
for a persistent disk resource (e.g. using 3 1TB replicas, don’t need to replicate 1TB over
the network when only 1 of the 3 replicas is going to be unavailable for a reboot (< 1
hour)).
> There are a few primitives of interest here:
> * Provide a way for operators to [fully shutdown a slave|https://issues.apache.org/jira/browse/MESOS-1475]
(killing all tasks underneath it). Colloquially known as a "hard drain".
> * Provide a way for operators to mark specific slaves as scheduled for maintenance. This
will inform the scheduler about the scheduled unavailability of the resources.
> * Provide a way for frameworks to be notified when resources are requested to be relinquished.
This gives the framework to proactively move a task before it may be forcibly killed by an
operator. It also allows the automation of operations like: "please drain these slaves within
1 hour."
> See the [design doc|https://docs.google.com/a/twitter.com/document/d/16k0lVwpSGVOyxPSyXKmGC-gbNmRlisNEe4p-fAUSojk/edit#]
for the latest details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message