Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B14F117CE for ; Wed, 30 Jul 2014 01:37:18 +0000 (UTC) Received: (qmail 19223 invoked by uid 500); 30 Jul 2014 01:37:18 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 19173 invoked by uid 500); 30 Jul 2014 01:37:17 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 19156 invoked by uid 99); 30 Jul 2014 01:37:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 01:37:17 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benjamin.mahler@gmail.com designates 209.85.220.179 as permitted sender) Received: from [209.85.220.179] (HELO mail-vc0-f179.google.com) (209.85.220.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Jul 2014 01:37:12 +0000 Received: by mail-vc0-f179.google.com with SMTP id hq11so802564vcb.38 for ; Tue, 29 Jul 2014 18:36:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=nsP8uqQY8y74ggpNTxrKwGp00XnW3cLnm+ZjKqeLrak=; b=hlLjKoH0B+JMdwvcENlRE6Bj1NfD93boM647eTuC6uvSfzamEFu/3VgLb30URHU/wb 766yqJi93Ss+uFikOtWqLwxW4EG1mAeZruZEHfBRvGqQWAvjNfRMM2mF3ddx6AxLiV8N lEmCUM1/AB1KlX5eCCB/lPWQcRc3/Umym048UEvYTGQzp66XrcutH3f07DacNhnoe958 B8pbPG51aowJMeuxrvWrgCoEMFcz8sbUhlbfa6D9xinbvwhWvfAykyiX1ocUE8pW0tO4 BOOVPQbTMRtsfadXueq9LKkOD1e3pVha7uKI4c+LV5Ot/M9K8IysgSMbDuLrF+7H7bGr Ts2w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=nsP8uqQY8y74ggpNTxrKwGp00XnW3cLnm+ZjKqeLrak=; b=CS/9zj4R20fX9vE5NrnKhYlsYV1hmhFnY7rWCjyPGJbE4z+i9o7A3dGGkLkfYgMMaK K9ZyNcWdgO5qG6zq7lBv3lGJXTGYIr/pqDad6vu1l8kW6ouhK11E2Tg2ZGtKqb0fhLtW GwBfmu/9AKz9+40ZQQ87Bnadpmzw8V6JpSzEwtbVIpXaF6SxKNt/PHKiRQpo18EYZIav DsJvow1cF9aV87xz3ob+qQArPiXzPASuRaQjnu+ySifbqBVCR8CTgBoXZwyUUxVH/OrO LadsbU/CFwibOONEsbX99r1Pa0zuazRL9LVrse3p/O6vwvwZAKocxxQuW7DiLpdSyA6b 2G7w== X-Received: by 10.52.243.135 with SMTP id wy7mr3916190vdc.82.1406684211682; Tue, 29 Jul 2014 18:36:51 -0700 (PDT) Received: from mail-vc0-f178.google.com (mail-vc0-f178.google.com [209.85.220.178]) by mx.google.com with ESMTPSA id mz10sm1636500vdb.17.2014.07.29.18.36.51 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 29 Jul 2014 18:36:51 -0700 (PDT) Received: by mail-vc0-f178.google.com with SMTP id la4so823529vcb.9 for ; Tue, 29 Jul 2014 18:36:50 -0700 (PDT) X-Gm-Message-State: ALoCoQnkfEFlDGQHlIPnecokjG51BTcwJVC3/5wePJBWIXKWbr1Ci02VwMVv9I09jTu344nwYLZ4 X-Received: by 10.220.173.134 with SMTP id p6mr772098vcz.36.1406684210830; Tue, 29 Jul 2014 18:36:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.191.66 with HTTP; Tue, 29 Jul 2014 18:36:30 -0700 (PDT) In-Reply-To: References: From: Benjamin Mahler Date: Tue, 29 Jul 2014 18:36:30 -0700 Message-ID: Subject: Re: git commit: Task health status change notifications To: dev , Niklas Nielsen , Connor Doyle Content-Type: multipart/alternative; boundary=089e0158b154b2b6a204ff5f3006 X-Virus-Checked: Checked by ClamAV on apache.org --089e0158b154b2b6a204ff5f3006 Content-Type: text/plain; charset=UTF-8 Not sure if it's related to this commit, but seems the GracePeriod test is flaky now: https://issues.apache.org/jira/browse/MESOS-1653 Could you help triage this ticket? On Tue, Jul 29, 2014 at 3:47 PM, wrote: > Repository: mesos > Updated Branches: > refs/heads/master 98557a7cf -> f66289831 > > > Task health status change notifications > > The reusable health check program added in #22579 emits TaskStatus > messages when the task under supervision first becomes viable (when the > task passes its first health check). It also emits a message when a > task changes state from healthy to unhealthy. > > However, the scheduler should be notified for _every_ observed change in > health status. It's easy to imagine cases where the scheduler wants to > wait a while before killing an unhealthy task, but still be notified of > status changes so that load balancers may be updated, etc. This patch > therefore causes the scheduler to also be notified when an unhealthy > task becomes healthy again. > > Review: https://reviews.apache.org/r/23966 > > > Project: http://git-wip-us.apache.org/repos/asf/mesos/repo > Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/f6628983 > Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/f6628983 > Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/f6628983 > > Branch: refs/heads/master > Commit: f6628983165e4b0a2f44bb288ff87041f9e5e1bb > Parents: 98557a7 > Author: Connor Doyle > Authored: Tue Jul 29 15:46:45 2014 -0700 > Committer: Niklas Q. Nielsen > Committed: Tue Jul 29 15:46:45 2014 -0700 > > ---------------------------------------------------------------------- > src/health-check/main.cpp | 5 +- > src/tests/health_check_tests.cpp | 87 +++++++++++++++++++++++++++++++++++ > 2 files changed, 91 insertions(+), 1 deletion(-) > ---------------------------------------------------------------------- > > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/f6628983/src/health-check/main.cpp > ---------------------------------------------------------------------- > diff --git a/src/health-check/main.cpp b/src/health-check/main.cpp > index 95d881e..10d57a0 100644 > --- a/src/health-check/main.cpp > +++ b/src/health-check/main.cpp > @@ -121,7 +121,10 @@ private: > void success() > { > VLOG(1) << "Check passed"; > - if (initializing) { > + > + // Send a healthy status update on the first success, > + // and on the first success following failure(s). > + if (initializing || consecutiveFailures > 0) { > TaskHealthStatus taskHealthStatus; > taskHealthStatus.set_healthy(true); > taskHealthStatus.mutable_task_id()->CopyFrom(taskID); > > > http://git-wip-us.apache.org/repos/asf/mesos/blob/f6628983/src/tests/health_check_tests.cpp > ---------------------------------------------------------------------- > diff --git a/src/tests/health_check_tests.cpp > b/src/tests/health_check_tests.cpp > index aa5b78b..6c54ea8 100644 > --- a/src/tests/health_check_tests.cpp > +++ b/src/tests/health_check_tests.cpp > @@ -174,6 +174,93 @@ TEST_F(HealthCheckTest, HealthyTask) > Shutdown(); > } > > +// Testing health status change reporting to scheduler. > +TEST_F(HealthCheckTest, HealthStatusChange) > +{ > + Try > master = StartMaster(); > + ASSERT_SOME(master); > + > + slave::Flags flags = CreateSlaveFlags(); > + flags.isolation = "posix/cpu,posix/mem"; > + > + Try containerizer = > + MesosContainerizer::create(flags, false); > + CHECK_SOME(containerizer); > + > + Try > slave = StartSlave(containerizer.get()); > + ASSERT_SOME(slave); > + > + MockScheduler sched; > + MesosSchedulerDriver driver( > + &sched, DEFAULT_FRAMEWORK_INFO, master.get(), DEFAULT_CREDENTIAL); > + > + EXPECT_CALL(sched, registered(&driver, _, _)); > + > + Future > offers; > + EXPECT_CALL(sched, resourceOffers(&driver, _)) > + .WillOnce(FutureArg<1>(&offers)) > + .WillRepeatedly(Return()); // Ignore subsequent offers. > + > + driver.start(); > + > + AWAIT_READY(offers); > + EXPECT_NE(0u, offers.get().size()); > + > + // Create a temporary file. > + Try temporaryPath = os::mktemp(); > + ASSERT_SOME(temporaryPath); > + string tmpPath = temporaryPath.get(); > + > + // This command fails every other invocation. > + // For all runs i in Nat0, the following case i % 2 applies: > + // > + // Case 0: > + // - Remove the temporary file. > + // > + // Case 1: > + // - Attempt to remove the nonexistent temporary file. > + // - Create the temporary file. > + // - Exit with a non-zero status. > + string alt = "rm " + tmpPath + " || (touch " + tmpPath + " && exit 1)"; > + > + vector tasks = populateTasks( > + "sleep 20", alt, offers.get()[0], 0, 3); > + > + Future statusRunning; > + Future statusHealth1; > + Future statusHealth2; > + Future statusHealth3; > + > + EXPECT_CALL(sched, statusUpdate(&driver, _)) > + .WillOnce(FutureArg<1>(&statusRunning)) > + .WillOnce(FutureArg<1>(&statusHealth1)) > + .WillOnce(FutureArg<1>(&statusHealth2)) > + .WillOnce(FutureArg<1>(&statusHealth3)); > + > + driver.launchTasks(offers.get()[0].id(), tasks); > + > + AWAIT_READY(statusRunning); > + EXPECT_EQ(TASK_RUNNING, statusRunning.get().state()); > + > + AWAIT_READY(statusHealth1); > + EXPECT_EQ(TASK_RUNNING, statusHealth1.get().state()); > + EXPECT_TRUE(statusHealth1.get().healthy()); > + > + AWAIT_READY(statusHealth2); > + EXPECT_EQ(TASK_RUNNING, statusHealth2.get().state()); > + EXPECT_FALSE(statusHealth2.get().healthy()); > + > + AWAIT_READY(statusHealth3); > + EXPECT_EQ(TASK_RUNNING, statusHealth3.get().state()); > + EXPECT_TRUE(statusHealth3.get().healthy()); > + > + os::rm(tmpPath); // Clean up the temporary file. > + > + driver.stop(); > + driver.join(); > + > + Shutdown(); > +} > > // Testing killing task after number of consecutive failures. > // Temporarily disabled due to MESOS-1613. > > --089e0158b154b2b6a204ff5f3006--