Return-Path: X-Original-To: apmail-mesos-user-archive@www.apache.org Delivered-To: apmail-mesos-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B36F3187EB for ; Mon, 22 Feb 2016 20:53:16 +0000 (UTC) Received: (qmail 69748 invoked by uid 500); 22 Feb 2016 20:53:10 -0000 Delivered-To: apmail-mesos-user-archive@mesos.apache.org Received: (qmail 69661 invoked by uid 500); 22 Feb 2016 20:53:09 -0000 Mailing-List: contact user-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mesos.apache.org Delivered-To: mailing list user@mesos.apache.org Received: (qmail 69652 invoked by uid 99); 22 Feb 2016 20:53:09 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Feb 2016 20:53:09 +0000 Received: from mail-ob0-f173.google.com (mail-ob0-f173.google.com [209.85.214.173]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 7FC781A0178 for ; Mon, 22 Feb 2016 20:53:07 +0000 (UTC) Received: by mail-ob0-f173.google.com with SMTP id gc3so175581358obb.3 for ; Mon, 22 Feb 2016 12:53:07 -0800 (PST) X-Gm-Message-State: AG10YOQZzGm9bdifYjjiWgjNOOGBZlla7rmDmD2aaypM9wlDzVxC3toiU0Q6UimRYvqx9YHxu1jZGtx+fv5CDXsA X-Received: by 10.60.226.134 with SMTP id rs6mr25468241oec.69.1456174386493; Mon, 22 Feb 2016 12:53:06 -0800 (PST) MIME-Version: 1.0 Received: by 10.157.32.16 with HTTP; Mon, 22 Feb 2016 12:52:36 -0800 (PST) In-Reply-To: References: From: Zameer Manji Date: Mon, 22 Feb 2016 12:52:36 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Safe update of agent attributes To: user@mesos.apache.org Content-Type: multipart/alternative; boundary=001a1136a22009dcce052c620500 --001a1136a22009dcce052c620500 Content-Type: text/plain; charset=UTF-8 Zhitao, In my experience the best way to manage these attributes is to ensure attribute changes are minimal (ie one attribute at a time) and roll them out slowly across a cluster. This way you can catch unsafe mutations quickly and rollback if needed. I don't think there is a whitelist/blacklist of attributes to reference so I think this is the safest way to go. On Mon, Feb 22, 2016 at 12:11 PM, Zhitao Li wrote: > Hi, > > We recently discovered that updating attributes on Mesos agents is a very > risk operation, and has a potential to send agent(s) into a crash loop if > not done properly with errors like "Failed to perform recovery: > Incompatible slave info detected". This combined with --recovery_timeout > made the situation even worse. > > In our setup, some of the attributes are generated from automated > configuration management system, so this opens a possibility that "bad" > configuration could be left on the machine and causing big trouble on next > agent upgrade, if the USR1 signal was not sent on time. > > Some questions: > > 1. Does anyone have a good practice recommended on managing these > attributes safely? > 2. Has Mesos considered to fallback to old metadata if it detects > incompatibility, so agents would keep running with old attributes instead > of falling into crash loop? > > Thanks. > > -- > Cheers, > > Zhitao Li > > -- > Zameer Manji > > --001a1136a22009dcce052c620500 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Zhitao,

In my experience the best way t= o manage these attributes is to ensure attribute changes are minimal (ie on= e attribute at a time) and roll them out slowly across a cluster. This way = you can catch unsafe mutations quickly and rollback if needed.
I don't think there is a whitelist/blacklist of attributes= to reference so I think this is the safest way to go.

On Mon, Feb 22, 2016 at 12= :11 PM, Zhitao Li <zhitaoli.cs@gmail.com> wrote:
= Hi,

We recently discovered that updating attributes on Mesos agents is a v= ery risk operation, and has a potential to send agent(s) into a crash loop = if not done properly with errors like "Failed to perform recovery:=C2=A0Incompatible=C2=A0slave info detected". This com= bined with --recovery_timeout made the situation even worse.

In our setup, some of the attributes are genera= ted from automated configuration management system, so this opens a possibi= lity that "bad" configuration could be left on the machine and ca= using big trouble on next agent upgrade, if the USR1 signal was not sent on= time.

Some questions:

1. Does anyone have a good prac= tice recommended on managing these attributes safely?
2. Has Mesos considered to fallback to old metadata if it det= ects incompatibility, so agents would keep running with old attributes inst= ead of falling into crash loop?

Thanks.

--
Cheers,

Zhitao Li

--
Zameer Manji

--001a1136a22009dcce052c620500--