Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D58F617793 for ; Wed, 4 Feb 2015 20:14:21 +0000 (UTC) Received: (qmail 75983 invoked by uid 500); 4 Feb 2015 20:14:22 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 75915 invoked by uid 500); 4 Feb 2015 20:14:22 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 75900 invoked by uid 99); 4 Feb 2015 20:14:22 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Feb 2015 20:14:22 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id E22361CC407; Wed, 4 Feb 2015 20:14:17 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============4592622384292755559==" MIME-Version: 1.0 Subject: Re: Review Request 30514: Rate limited the removal of slaves failing health checks. From: "Jie Yu" To: "David Robinson" , "Ben Mahler" , "Jie Yu" Cc: "Vinod Kone" , "mesos" Date: Wed, 04 Feb 2015 20:14:17 -0000 Message-ID: <20150204201417.1285.62462@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org/ Auto-Submitted: auto-generated Sender: "Jie Yu" X-ReviewGroup: mesos X-ReviewRequest-URL: https://reviews.apache.org/r/30514/ X-Sender: "Jie Yu" References: <20150204200111.1286.26093@reviews.apache.org> In-Reply-To: <20150204200111.1286.26093@reviews.apache.org> Reply-To: "Jie Yu" X-ReviewRequest-Repository: mesos --===============4592622384292755559== MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit > On Feb. 4, 2015, 8:01 p.m., Ben Mahler wrote: > > src/master/master.cpp, lines 197-246 > > > > > > It looks like we only log when we're shutting down the slave, when there's no rate limiter? > > > > Have you considered having a single shutdown code path? This gets you a single dispatch, and a single log statement and it looks like it's less control flow? > > > > ``` > > void shutdown() > > { > > if (shuttingDown.isNone()) { > > Future future = Nothing(); > > > > if (limiter.isSome()) { > > LOG(INFO) << "Scheduling shutdown of slave " << slaveId > > << " due to health check timeout"; > > > > future = limiter.get()->acquire(); > > } > > > > shuttingDown = future.onAny(defer(self(), &Self::_shutdown)); > > > > ++metrics->slave_shutdowns_scheduled; > > } > > } > > > > void _shutdown() > > { > > CHECK_SOME(shuttingDown); > > > > const Future& future = shuttingDown.get(); > > > > CHECK(!future.isFailed()) << future.failure(); > > > > if (future.isReady()) { > > LOG(INFO) << "Shutting down slave " << slaveId > > << " due to health check timeout"; > > > > dispatch(master, > > &Master::shutdownSlave, > > slaveId, > > "health check timed out"); > > } else if (future.isDiscarded()) { > > // Shutdown was canceled. > > LOG(INFO) << "Canceling shutdown of slave " << slaveId > > << " since a pong is received!"; > > > > ++metrics->slave_shutdowns_canceled; > > } > > > > shuttingDown = None(); > > } > > ``` > > Jie Yu wrote: > Having both 'future' and 'shuttingDown' in 'shutdown' method looks confusing to me:) You need to comment about why you need that future. Or, I think having dup code is not a too bad thing in this case as the logic is more clear IMO. Another option to avoid dup code is: 1) s/shutdown/scheduleShutdown 2) let 'shutdown' be the actual shutdown (sending message) - Jie ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30514/#review71021 ----------------------------------------------------------- On Feb. 4, 2015, 1:51 a.m., Vinod Kone wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30514/ > ----------------------------------------------------------- > > (Updated Feb. 4, 2015, 1:51 a.m.) > > > Review request for mesos, Ben Mahler, David Robinson, and Jie Yu. > > > Bugs: MESOS-1148 > https://issues.apache.org/jira/browse/MESOS-1148 > > > Repository: mesos > > > Description > ------- > > The algorithm is simple. All the slave observers share a rate limiter whose rate is configured via command line. When a slave times out on health check, a permit is acquired to shutdown the slave *but* the pings are continued to be sent. If a pong arrives before the permit is satisifed, the shutdown is cancelled. > > > Diffs > ----- > > src/master/flags.hpp 6c18a1af625311ef149b5877b08f63c2b12c040d > src/master/master.hpp cd37ee9d3c57bcd91f08cd402ec1e4a09d9e42ee > src/master/master.cpp d04b2c4041d8fe8978b877f07579a6f907903e1b > src/tests/partition_tests.cpp fea78016268b007590516798eb30ff423fd0ae58 > src/tests/slave_tests.cpp e7e2af63da785644f3f7e6e23607c02be962a2c6 > > Diff: https://reviews.apache.org/r/30514/diff/ > > > Testing > ------- > > make check > > Ran the new tests 100 times > > > Thanks, > > Vinod Kone > > --===============4592622384292755559==--