mesos-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tommy xiao <xia...@gmail.com>
Subject Re: Speed up Mesos tests
Date Wed, 16 Dec 2015 17:01:41 GMT
+1

2015-12-16 2:15 GMT+08:00 Alex Rukletsov <alex@mesosphere.com>:

> Folks,
>
> I would like to share some facts and thoughts about tests. When I ran `make
> check -j7` on my Mac OS machine the other day, gtest reported the following
> (your numbers may vary depending on the OS you're on and filters you use):
> [==========] 882 tests from 117 test cases ran. (298610 ms total)
>
> Same command for Mesos 0.21.1, which has been released around a year ago,
> yields
> [==========] 452 tests from 71 test cases ran. (196398 ms total)
>
> We almost doubled the number of tests in 2015. I think this is a great
> achievement per se, moreover it makes the life of cluster operators,
> release managers, and Mesos contributors less stressful. I am going to have
> an extra glass of champagne to celebrate this at the upcoming New Year Eve
> : ).
>
> There are still some flaky tests left — and there always will be, failure
> is embedded into progress —, but it is not the flakiness I would like to
> discuss today. I would like to draw your attention to the last number in
> the gtest output lines above.
>
> When adding tests, we also contribute to the time it takes for a complete
> test suite to run. There are multiple ways how we can keep this number
> small (one is, heh, write less tests : ) ). Today I propose to focus on
> reducing duration of individual test cases.
>
> Mesos tests are often build around certain sequences of events, some of
> those have timeouts, some are dependent on other events. Naive test
> implementations sometimes lead to test being blocked by the duration of
> some timeout, pointlessly slowing down the whole suite! A good indicator of
> such a test is that its duration is an integral number of seconds (the
> timeout) plus some delta (actual testing code), for example 3123 ms, 5076
> ms.
>
> Suggestion: If you write a new test, please look at the test duration as
> well, if it seems unreasonably long, investigate what the reasons are and
> how you can make the test faster.
>
> State of the art:
>   * Slave recovery tests are known to be slow, see MESOS-733 [1].
>   * Ben Mahler created an epic to track slow tests more than a year ago
> (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
>   * Dominic Hamon did pretty much what I have done (with a much nicer
> command, too bad I noticed that after generating the list myself) and filed
> MESOS-2059 [4].
>
> To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
> grep "ms)"` and noted down tests that took more than 1 second to complete.
> To my knowledge, 1s is the shortest timeout we use in default values for
> configurable parameters.
>
> For each test from the list I either created a JIRA ticket, or grouped a
> bunch of seemingly related tickets into an epic (details below). I hijacked
> MESOS-1757 [2] and made it a parent for all newly created epics and
> tickets.
>
> I would like to encourage folks to look at these tickets and work on them
> when they have time and mood. Apart making `make check` faster, I believe
> that most of these tickets are actually a very good way to familiarize
> yourself with the Mesos codebase (hence I marked all tickets as
> `newbie++`), so if you would like to contribute to Mesos but do not know
> where to start — this can be a good choice!
>
> It is clear that some tickets are false positives and there exists a good
> reason why this particular test takes longer than others. In this case a
> comment explaining this reason is a proper resolution for the ticket.
>
> To avoid difficulties with finding a shepherd, I would suggest
> investigating the test first, understanding the reason for the slowness,
> and updating the ticket, so that a potential shepherd can easier estimate
> the amount of time necessary for fixing the issue. Investigating does not
> require a shepherd, and once it is done, all following steps (finding a
> shepherd, submitting a patch, getting it committed) are trivial.
>
> I believe some tests may share the same root cause (for example, they rely
> on the same timeout, which cannot be changed from the test harness). In
> this case all such tests can be fixed by a single change.
>
> Below are the suspect tests.
>   * Examples tests, slow since early days, see MESOS-297 [3]. Filed
> MESOS-4155 [6].
>   * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
>   * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
> Filed MESOS-4157 [8].
>   * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-297
> [3]. Filed MESOS-4158 [9].
>   * Group tests, filed MESOS-4159 [10].
>   * Recover tests, filed MESOS-4160 [11].
>
>   * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12].
>   * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
>   * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
>   * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
>   * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
>   * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
>   * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
>   * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
> MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
>   * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-4168
> [19].
>   * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
>   * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
> filed MESOS-4170 [21].
>   * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
> filed MESOS-4171 [22].
>   * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
> [23].
>   * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24].
>   * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174
> [25].
>   * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].
>
> Thanks for reading this up till this point,
> AlexR
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-733
> [2] https://issues.apache.org/jira/browse/MESOS-1757
> [3] https://issues.apache.org/jira/browse/MESOS-297
> [4] https://issues.apache.org/jira/browse/MESOS-2059
> [5] https://issues.apache.org/jira/browse/MESOS-3775
> [6] https://issues.apache.org/jira/browse/MESOS-4155
> [7] https://issues.apache.org/jira/browse/MESOS-4156
> [8] https://issues.apache.org/jira/browse/MESOS-4157
> [9] https://issues.apache.org/jira/browse/MESOS-4158
> [10] https://issues.apache.org/jira/browse/MESOS-4159
> [11] https://issues.apache.org/jira/browse/MESOS-4160
> [12] https://issues.apache.org/jira/browse/MESOS-4161
> [13] https://issues.apache.org/jira/browse/MESOS-4162
> [14] https://issues.apache.org/jira/browse/MESOS-4163
> [15] https://issues.apache.org/jira/browse/MESOS-4164
> [16] https://issues.apache.org/jira/browse/MESOS-4165
> [17] https://issues.apache.org/jira/browse/MESOS-4166
> [18] https://issues.apache.org/jira/browse/MESOS-4167
> [19] https://issues.apache.org/jira/browse/MESOS-4168
> [20] https://issues.apache.org/jira/browse/MESOS-4169
> [21] https://issues.apache.org/jira/browse/MESOS-4170
> [22] https://issues.apache.org/jira/browse/MESOS-4171
> [23] https://issues.apache.org/jira/browse/MESOS-4172
> [24] https://issues.apache.org/jira/browse/MESOS-4173
> [25] https://issues.apache.org/jira/browse/MESOS-4174
> [26] https://issues.apache.org/jira/browse/MESOS-4175
>



-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message