Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 45CF818065 for ; Thu, 29 Oct 2015 18:33:20 +0000 (UTC) Received: (qmail 95012 invoked by uid 500); 29 Oct 2015 18:33:19 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 94948 invoked by uid 500); 29 Oct 2015 18:33:19 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 94937 invoked by uid 99); 29 Oct 2015 18:33:19 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Oct 2015 18:33:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 589FEC8E55 for ; Thu, 29 Oct 2015 18:33:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.151 X-Spam-Level: *** X-Spam-Status: No, score=3.151 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id kWllxjXQDDyX for ; Thu, 29 Oct 2015 18:33:05 +0000 (UTC) Received: from mail-vk0-f48.google.com (mail-vk0-f48.google.com [209.85.213.48]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 9E86620E9B for ; Thu, 29 Oct 2015 18:33:04 +0000 (UTC) Received: by vkgy127 with SMTP id y127so32877348vkg.0 for ; Thu, 29 Oct 2015 11:33:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-type; bh=7TaOx1X67AJFPv7e2uS11EGgfBzT+zFOGR0cI9IZmHY=; b=ae0g8YYhKgVhqtCVIyImScZofLgEfrXdMDah1zxBJz7oLfGdw/1MtdJOjeuqLACIKD CdgfeFbXcQ8O2/8CSDmrX5xv5tBlWo7R4nADBd/EXqQ/Ii/2X1DZ81EZylLHkOJCnLVH /IjIxhgGlbjtaeie6BYrbSYASkGUkZLpbZUdAS9DY2fUEKhwxW2PIEhkZlK/k6HMEGRV Lhy6PkuLLFkg91WkmtPEnA4H9Gax16Ow96LVzRgpZn+8kEepfNFbnynSFZQRVRNaX012 DUgDNKGeQYRktOCTzi1PSFb6a3vTCrnt8eK4Jwh8LBIM+mysC4jXvnkbsqkh8luHODFv 1EXg== X-Received: by 10.31.11.1 with SMTP id 1mr2470785vkl.119.1446143583424; Thu, 29 Oct 2015 11:33:03 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: David Greenberg Date: Thu, 29 Oct 2015 18:32:53 +0000 Message-ID: Subject: Re: Cluster Maintanence To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a1145784e9597b30523428ad1 --001a1145784e9597b30523428ad1 Content-Type: text/plain; charset=UTF-8 I'm happy to answer any questions about Satellite--we use it at Two Sigma for automated and manual maintenance of our huge Mesos clusters. With Satellite, you can use the REST endpoint to begin draining agents, just like the Mesos maintenance API. One difference is that, in Satellite, if you mark an agent as being down for maintenance, you must also include the reason, which is useful in larger organizations, since anyone can see when and why an agent was drained. Also, Satellite can automatically drain agents that fail arbitrary health checks, and generate alerts when it decides to do this. The neat thing with Satellite is that the automatic and manual maintenance are thoughtfully integrated based on our experiences running Mesos clusters for more than a year. This way, you can have the best of planned and automated maintenance with flexible alerting. On Thu, Oct 29, 2015 at 11:24 AM Radoslaw Gruchalski wrote: > I've heard of this: https://github.com/twosigma/satellite > Never used it though. > > Sent from Outlook > > > > > On Thu, Oct 29, 2015 at 11:20 AM -0700, "John Omernik" > wrote: > > I am wondering if there are some easy ways to take a healthy slave/agent >> and start a process to bleed processes out. >> >> Basically, without having to do something where every framework would >> support it, I'd like the option to >> >> 1. Stop offering resources to new frameworks. I.e. no new resources would >> be offered, but existing jobs/tasks continue to run. >> 2. Offer the ability, especially in the UI, but potentially in API as >> well to "kill" a task. This would cause a failure that force the framework >> to respond. For example, if it was a docker container running in marathon, >> if I said "please kill this task" it would, marathon would recognize the >> failure and try to restart the container. Since our agent (in point 1) is >> not offering resources, then that task would not fall on the agent in >> question. >> >> >> The reason for this manual bleeding is to say run updates on a node or >> pull it out of service for other reasons (memory upgrades etc) and do so in >> a manual way. You may want to address what's running on the node manually, >> thus a whole scale "kill everything" while it SHOULD be doable, may not >> always be feasible. In addition, the inverse offers thing seems neat, but >> frameworks have to support it. >> >> So, is there any thing like that now and I am just missing it in the >> documentation? I am curious to hear how others are handling this situation >> in their environments. >> >> John >> >> >> >> --001a1145784e9597b30523428ad1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I'm happy to answer any questions about Satellite--we = use it at Two Sigma for automated and manual maintenance of our huge Mesos = clusters. With Satellite, you can use the REST endpoint to begin draining a= gents, just like the Mesos maintenance API. One difference is that, in Sate= llite, if you mark an agent as being down for maintenance, you must also in= clude the reason, which is useful in larger organizations, since anyone can= see when and why an agent was drained.

Also, Satellite = can automatically drain agents that fail arbitrary health checks, and gener= ate alerts when it decides to do this. The neat thing with Satellite is tha= t the automatic and manual maintenance are thoughtfully integrated based on= our experiences running Mesos clusters for more than a year. This way, you= can have the best of planned and automated maintenance with flexible alert= ing.

On Thu, Oct= 29, 2015 at 11:24 AM Radoslaw Gruchalski <radek@gruchalski.com> wrote:
Never used it though.

Sent from Outlook




On Thu, Oct 29, 2015 at 11:20 AM -0700, "Jo= hn Omernik" <john@omernik.com> wrote:

I am wondering if there are some easy ways to take a healt= hy slave/agent and start a process to bleed processes out. =C2=A0

<= /div>
Basically, without having to do something where every framework w= ould support it, I'd like the option to=C2=A0

= 1. Stop offering resources to new frameworks. I.e. no new resources would b= e offered, but existing jobs/tasks continue to run.=C2=A0
2.=C2= =A0 Offer the ability, especially in the UI, but potentially in API as well= to "kill" a task.=C2=A0 This would cause a failure that force th= e framework to respond. For example, if it was a docker container running i= n marathon, if I said "please kill this task" it would, marathon = would recognize the failure and try to restart the container. Since our age= nt (in point 1) is not offering resources, then that task would not fall on= the agent in question. =C2=A0


The = reason for this manual bleeding is to say run updates on a node or pull it = out of service for other reasons (memory upgrades etc) and do so in a manua= l way.=C2=A0 You may want to address what's running on the node manuall= y, thus a whole scale "kill everything" while it SHOULD be doable= , may not always be feasible. In addition, the inverse offers thing seems n= eat, but frameworks have to support it. =C2=A0

So,= is there any thing like that now and I am just missing it in the documenta= tion?=C2=A0 I am curious to hear how others are handling this situation in = their environments.=C2=A0

John



--001a1145784e9597b30523428ad1--