Return-Path: X-Original-To: apmail-mesos-dev-archive@www.apache.org Delivered-To: apmail-mesos-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 73A5C187E2 for ; Wed, 16 Dec 2015 17:02:28 +0000 (UTC) Received: (qmail 84604 invoked by uid 500); 16 Dec 2015 17:02:28 -0000 Delivered-To: apmail-mesos-dev-archive@mesos.apache.org Received: (qmail 84509 invoked by uid 500); 16 Dec 2015 17:02:28 -0000 Mailing-List: contact dev-help@mesos.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mesos.apache.org Delivered-To: mailing list dev@mesos.apache.org Received: (qmail 84497 invoked by uid 99); 16 Dec 2015 17:02:27 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Dec 2015 17:02:27 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 50CDC1804FD for ; Wed, 16 Dec 2015 17:02:27 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.901 X-Spam-Level: ** X-Spam-Status: No, score=2.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id adkS0ww8P7lt for ; Wed, 16 Dec 2015 17:02:14 +0000 (UTC) Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com [209.85.213.173]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 8575320467 for ; Wed, 16 Dec 2015 17:02:14 +0000 (UTC) Received: by mail-ig0-f173.google.com with SMTP id to4so73568765igc.0 for ; Wed, 16 Dec 2015 09:02:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=jhTxfmoemMa3xDEZ7qQ3uDeCWkpvEZaGWV98Ebw5+S0=; b=bcml2nj8Bzrsd+XE64N3dfZVFQexXtP2J+3goqQRVDV2xuPdANqqFMmu1RwTqRmK9j jy2oGZZ9n7D+lpEznbCZ3dQzwVO7s9xUFLDMV+4gX++5BuAB9GkF/v65SZDSt5ope8TS ji36GrtHkhgLR9AC1NQKqLKuB7U4HSua7vt5qFhpErM9PbAyLYtUBvA6kvM8hZktjJC9 DkB30GaGRs3eI2jASrbyZgRnSwajPYAGN1lqWzb1t9LPnDtN6V/R0tQ4XrPg1VQWA2TF xH8SCTu4M5ZdtUldHh5Rb7W8jByvjQEB1ilJ6TnLwpxWTz5eXpvU2mWyxZL4XMOVU3jU eaOA== MIME-Version: 1.0 X-Received: by 10.50.20.8 with SMTP id j8mr11767268ige.94.1450285301618; Wed, 16 Dec 2015 09:01:41 -0800 (PST) Received: by 10.50.77.197 with HTTP; Wed, 16 Dec 2015 09:01:41 -0800 (PST) In-Reply-To: References: Date: Thu, 17 Dec 2015 01:01:41 +0800 Message-ID: Subject: Re: Speed up Mesos tests From: tommy xiao To: dev@mesos.apache.org Content-Type: multipart/alternative; boundary=047d7bd7526639ef2b052706dc7b --047d7bd7526639ef2b052706dc7b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable +1 2015-12-16 2:15 GMT+08:00 Alex Rukletsov : > Folks, > > I would like to share some facts and thoughts about tests. When I ran `ma= ke > check -j7` on my Mac OS machine the other day, gtest reported the followi= ng > (your numbers may vary depending on the OS you're on and filters you use)= : > [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D] 882 tests from 117 test cases ran. (2986= 10 ms total) > > Same command for Mesos 0.21.1, which has been released around a year ago, > yields > [=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D] 452 tests from 71 test cases ran. (19639= 8 ms total) > > We almost doubled the number of tests in 2015. I think this is a great > achievement per se, moreover it makes the life of cluster operators, > release managers, and Mesos contributors less stressful. I am going to ha= ve > an extra glass of champagne to celebrate this at the upcoming New Year Ev= e > : ). > > There are still some flaky tests left =E2=80=94 and there always will be,= failure > is embedded into progress =E2=80=94, but it is not the flakiness I would = like to > discuss today. I would like to draw your attention to the last number in > the gtest output lines above. > > When adding tests, we also contribute to the time it takes for a complete > test suite to run. There are multiple ways how we can keep this number > small (one is, heh, write less tests : ) ). Today I propose to focus on > reducing duration of individual test cases. > > Mesos tests are often build around certain sequences of events, some of > those have timeouts, some are dependent on other events. Naive test > implementations sometimes lead to test being blocked by the duration of > some timeout, pointlessly slowing down the whole suite! A good indicator = of > such a test is that its duration is an integral number of seconds (the > timeout) plus some delta (actual testing code), for example 3123 ms, 5076 > ms. > > Suggestion: If you write a new test, please look at the test duration as > well, if it seems unreasonably long, investigate what the reasons are and > how you can make the test faster. > > State of the art: > * Slave recovery tests are known to be slow, see MESOS-733 [1]. > * Ben Mahler created an epic to track slow tests more than a year ago > (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]). > * Dominic Hamon did pretty much what I have done (with a much nicer > command, too bad I noticed that after generating the list myself) and fil= ed > MESOS-2059 [4]. > > To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null | > grep "ms)"` and noted down tests that took more than 1 second to complete= . > To my knowledge, 1s is the shortest timeout we use in default values for > configurable parameters. > > For each test from the list I either created a JIRA ticket, or grouped a > bunch of seemingly related tickets into an epic (details below). I hijack= ed > MESOS-1757 [2] and made it a parent for all newly created epics and > tickets. > > I would like to encourage folks to look at these tickets and work on them > when they have time and mood. Apart making `make check` faster, I believe > that most of these tickets are actually a very good way to familiarize > yourself with the Mesos codebase (hence I marked all tickets as > `newbie++`), so if you would like to contribute to Mesos but do not know > where to start =E2=80=94 this can be a good choice! > > It is clear that some tickets are false positives and there exists a good > reason why this particular test takes longer than others. In this case a > comment explaining this reason is a proper resolution for the ticket. > > To avoid difficulties with finding a shepherd, I would suggest > investigating the test first, understanding the reason for the slowness, > and updating the ticket, so that a potential shepherd can easier estimate > the amount of time necessary for fixing the issue. Investigating does not > require a shepherd, and once it is done, all following steps (finding a > shepherd, submitting a patch, getting it committed) are trivial. > > I believe some tests may share the same root cause (for example, they rel= y > on the same timeout, which cannot be changed from the test harness). In > this case all such tests can be fixed by a single change. > > Below are the suspect tests. > * Examples tests, slow since early days, see MESOS-297 [3]. Filed > MESOS-4155 [6]. > * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7]. > * Zookeeper tests, some are slow since early days, see MESOS-297 [3]. > Filed MESOS-4157 [8]. > * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-2= 97 > [3]. Filed MESOS-4158 [9]. > * Group tests, filed MESOS-4159 [10]. > * Recover tests, filed MESOS-4160 [11]. > > * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12= ]. > * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13]. > * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14]. > * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15]. > * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16]. > * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17]. > * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18]. > * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test, > MESOS-3775 [5]. The tests waits 5s for an executor to terminate. > * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-416= 8 > [19]. > * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20]. > * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms), > filed MESOS-4170 [21]. > * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms), > filed MESOS-4171 [22]. > * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172 > [23]. > * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24]= . > * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174 > [25]. > * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26]. > > Thanks for reading this up till this point, > AlexR > > > [1] https://issues.apache.org/jira/browse/MESOS-733 > [2] https://issues.apache.org/jira/browse/MESOS-1757 > [3] https://issues.apache.org/jira/browse/MESOS-297 > [4] https://issues.apache.org/jira/browse/MESOS-2059 > [5] https://issues.apache.org/jira/browse/MESOS-3775 > [6] https://issues.apache.org/jira/browse/MESOS-4155 > [7] https://issues.apache.org/jira/browse/MESOS-4156 > [8] https://issues.apache.org/jira/browse/MESOS-4157 > [9] https://issues.apache.org/jira/browse/MESOS-4158 > [10] https://issues.apache.org/jira/browse/MESOS-4159 > [11] https://issues.apache.org/jira/browse/MESOS-4160 > [12] https://issues.apache.org/jira/browse/MESOS-4161 > [13] https://issues.apache.org/jira/browse/MESOS-4162 > [14] https://issues.apache.org/jira/browse/MESOS-4163 > [15] https://issues.apache.org/jira/browse/MESOS-4164 > [16] https://issues.apache.org/jira/browse/MESOS-4165 > [17] https://issues.apache.org/jira/browse/MESOS-4166 > [18] https://issues.apache.org/jira/browse/MESOS-4167 > [19] https://issues.apache.org/jira/browse/MESOS-4168 > [20] https://issues.apache.org/jira/browse/MESOS-4169 > [21] https://issues.apache.org/jira/browse/MESOS-4170 > [22] https://issues.apache.org/jira/browse/MESOS-4171 > [23] https://issues.apache.org/jira/browse/MESOS-4172 > [24] https://issues.apache.org/jira/browse/MESOS-4173 > [25] https://issues.apache.org/jira/browse/MESOS-4174 > [26] https://issues.apache.org/jira/browse/MESOS-4175 > --=20 Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com --047d7bd7526639ef2b052706dc7b--