Mailing-List: contact user-help@mesos.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mesos.apache.org
MIME-Version: 1.0
References: 
 <CAKOFcwppvh8R5SjdpE_nML-uHk12Af_NcV7ykz9P-5it2FfxDA@mail.gmail.com>
 <CC9DA76160E0602C.1DB66BCE-9421-4B50-B1AD-7E34ED811937@mail.outlook.com>
In-Reply-To: 
 <CC9DA76160E0602C.1DB66BCE-9421-4B50-B1AD-7E34ED811937@mail.outlook.com>
From: David Greenberg <dsg123456789@gmail.com>
Date: Thu, 29 Oct 2015 18:32:53 +0000
Message-ID: 
 <CAA6W+3QWSk-5i8p8YLckTiNT=7cOexqc1oOt8nzZ060_=Z6CQw@mail.gmail.com>
Subject: Re: Cluster Maintanence
To: user@mesos.apache.org
Content-Type: multipart/alternative; boundary=001a1145784e9597b30523428ad1

--001a1145784e9597b30523428ad1
Content-Type: text/plain; charset=UTF-8

I'm happy to answer any questions about Satellite--we use it at Two Sigma
for automated and manual maintenance of our huge Mesos clusters. With
Satellite, you can use the REST endpoint to begin draining agents, just
like the Mesos maintenance API. One difference is that, in Satellite, if
you mark an agent as being down for maintenance, you must also include the
reason, which is useful in larger organizations, since anyone can see when
and why an agent was drained.

Also, Satellite can automatically drain agents that fail arbitrary health
checks, and generate alerts when it decides to do this. The neat thing with
Satellite is that the automatic and manual maintenance are thoughtfully
integrated based on our experiences running Mesos clusters for more than a
year. This way, you can have the best of planned and automated maintenance
with flexible alerting.

On Thu, Oct 29, 2015 at 11:24 AM Radoslaw Gruchalski <radek@gruchalski.com>
wrote:

> I've heard of this: https://github.com/twosigma/satellite
> Never used it though.
>
> Sent from Outlook <http://aka.ms/Ox5hz3>
>
>
>
>
> On Thu, Oct 29, 2015 at 11:20 AM -0700, "John Omernik" <john@omernik.com>
> wrote:
>
> I am wondering if there are some easy ways to take a healthy slave/agent
>> and start a process to bleed processes out.
>>
>> Basically, without having to do something where every framework would
>> support it, I'd like the option to
>>
>> 1. Stop offering resources to new frameworks. I.e. no new resources would
>> be offered, but existing jobs/tasks continue to run.
>> 2.  Offer the ability, especially in the UI, but potentially in API as
>> well to "kill" a task.  This would cause a failure that force the framework
>> to respond. For example, if it was a docker container running in marathon,
>> if I said "please kill this task" it would, marathon would recognize the
>> failure and try to restart the container. Since our agent (in point 1) is
>> not offering resources, then that task would not fall on the agent in
>> question.
>>
>>
>> The reason for this manual bleeding is to say run updates on a node or
>> pull it out of service for other reasons (memory upgrades etc) and do so in
>> a manual way.  You may want to address what's running on the node manually,
>> thus a whole scale "kill everything" while it SHOULD be doable, may not
>> always be feasible. In addition, the inverse offers thing seems neat, but
>> frameworks have to support it.
>>
>> So, is there any thing like that now and I am just missing it in the
>> documentation?  I am curious to hear how others are handling this situation
>> in their environments.
>>
>> John
>>
>>
>>
>>

--001a1145784e9597b30523428ad1
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;m happy to answer any questions about Satellite--we =
use it at Two Sigma for automated and manual maintenance of our huge Mesos =
clusters. With Satellite, you can use the REST endpoint to begin draining a=
gents, just like the Mesos maintenance API. One difference is that, in Sate=
llite, if you mark an agent as being down for maintenance, you must also in=
clude the reason, which is useful in larger organizations, since anyone can=
 see when and why an agent was drained.<div><br></div><div>Also, Satellite =
can automatically drain agents that fail arbitrary health checks, and gener=
ate alerts when it decides to do this. The neat thing with Satellite is tha=
t the automatic and manual maintenance are thoughtfully integrated based on=
 our experiences running Mesos clusters for more than a year. This way, you=
 can have the best of planned and automated maintenance with flexible alert=
ing.</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Thu, Oct=
 29, 2015 at 11:24 AM Radoslaw Gruchalski &lt;<a href=3D"mailto:radek@gruch=
alski.com">radek@gruchalski.com</a>&gt; wrote:<br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div><div>I&#39;ve heard of this: <a href=3D"https://github.c=
om/twosigma/satellite" target=3D"_blank">https://github.com/twosigma/satell=
ite</a></div><div>Never used it though.<br><br><div>Sent from <a href=3D"ht=
tp://aka.ms/Ox5hz3" target=3D"_blank">Outlook</a></div><br></div></div><div=
><br><br><br>
<div class=3D"gmail_quote">On Thu, Oct 29, 2015 at 11:20 AM -0700, &quot;Jo=
hn Omernik&quot; <span dir=3D"ltr">&lt;<a href=3D"mailto:john@omernik.com" =
target=3D"_blank">john@omernik.com</a>&gt;</span> wrote:<br>
<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div dir=3D"3D&quot;ltr&quot;">
<div dir=3D"ltr">I am wondering if there are some easy ways to take a healt=
hy slave/agent and start a process to bleed processes out. =C2=A0<div><br><=
/div><div>Basically, without having to do something where every framework w=
ould support it, I&#39;d like the option to=C2=A0</div><div><br></div><div>=
1. Stop offering resources to new frameworks. I.e. no new resources would b=
e offered, but existing jobs/tasks continue to run.=C2=A0</div><div>2.=C2=
=A0 Offer the ability, especially in the UI, but potentially in API as well=
 to &quot;kill&quot; a task.=C2=A0 This would cause a failure that force th=
e framework to respond. For example, if it was a docker container running i=
n marathon, if I said &quot;please kill this task&quot; it would, marathon =
would recognize the failure and try to restart the container. Since our age=
nt (in point 1) is not offering resources, then that task would not fall on=
 the agent in question. =C2=A0</div><div><br></div><div><br></div><div>The =
reason for this manual bleeding is to say run updates on a node or pull it =
out of service for other reasons (memory upgrades etc) and do so in a manua=
l way.=C2=A0 You may want to address what&#39;s running on the node manuall=
y, thus a whole scale &quot;kill everything&quot; while it SHOULD be doable=
, may not always be feasible. In addition, the inverse offers thing seems n=
eat, but frameworks have to support it. =C2=A0</div><div><br></div><div>So,=
 is there any thing like that now and I am just missing it in the documenta=
tion?=C2=A0 I am curious to hear how others are handling this situation in =
their environments.=C2=A0</div><div><br></div><div>John</div><div><br></div=
><div><br></div><div><br></div></div>
</div>

</blockquote>
</div>
</div></blockquote></div>

--001a1145784e9597b30523428ad1--