mesos-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Greenberg <dsg123456...@gmail.com>
Subject Re: Cluster Maintanence
Date Thu, 29 Oct 2015 18:32:53 GMT
I'm happy to answer any questions about Satellite--we use it at Two Sigma
for automated and manual maintenance of our huge Mesos clusters. With
Satellite, you can use the REST endpoint to begin draining agents, just
like the Mesos maintenance API. One difference is that, in Satellite, if
you mark an agent as being down for maintenance, you must also include the
reason, which is useful in larger organizations, since anyone can see when
and why an agent was drained.

Also, Satellite can automatically drain agents that fail arbitrary health
checks, and generate alerts when it decides to do this. The neat thing with
Satellite is that the automatic and manual maintenance are thoughtfully
integrated based on our experiences running Mesos clusters for more than a
year. This way, you can have the best of planned and automated maintenance
with flexible alerting.

On Thu, Oct 29, 2015 at 11:24 AM Radoslaw Gruchalski <radek@gruchalski.com>
wrote:

> I've heard of this: https://github.com/twosigma/satellite
> Never used it though.
>
> Sent from Outlook <http://aka.ms/Ox5hz3>
>
>
>
>
> On Thu, Oct 29, 2015 at 11:20 AM -0700, "John Omernik" <john@omernik.com>
> wrote:
>
> I am wondering if there are some easy ways to take a healthy slave/agent
>> and start a process to bleed processes out.
>>
>> Basically, without having to do something where every framework would
>> support it, I'd like the option to
>>
>> 1. Stop offering resources to new frameworks. I.e. no new resources would
>> be offered, but existing jobs/tasks continue to run.
>> 2.  Offer the ability, especially in the UI, but potentially in API as
>> well to "kill" a task.  This would cause a failure that force the framework
>> to respond. For example, if it was a docker container running in marathon,
>> if I said "please kill this task" it would, marathon would recognize the
>> failure and try to restart the container. Since our agent (in point 1) is
>> not offering resources, then that task would not fall on the agent in
>> question.
>>
>>
>> The reason for this manual bleeding is to say run updates on a node or
>> pull it out of service for other reasons (memory upgrades etc) and do so in
>> a manual way.  You may want to address what's running on the node manually,
>> thus a whole scale "kill everything" while it SHOULD be doable, may not
>> always be feasible. In addition, the inverse offers thing seems neat, but
>> frameworks have to support it.
>>
>> So, is there any thing like that now and I am just missing it in the
>> documentation?  I am curious to hear how others are handling this situation
>> in their environments.
>>
>> John
>>
>>
>>
>>

Mime
View raw message