Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8A826200D29 for ; Thu, 12 Oct 2017 06:21:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 88F47160BE3; Thu, 12 Oct 2017 04:21:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A7BAD1609E5 for ; Thu, 12 Oct 2017 06:21:53 +0200 (CEST) Received: (qmail 29277 invoked by uid 500); 12 Oct 2017 04:21:52 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 29265 invoked by uid 99); 12 Oct 2017 04:21:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Oct 2017 04:21:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 7CE0F18080C for ; Thu, 12 Oct 2017 04:21:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.48 X-Spam-Level: ** X-Spam-Status: No, score=2.48 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id M5qJZy8CakEO for ; Thu, 12 Oct 2017 04:21:48 +0000 (UTC) Received: from mail-wm0-f47.google.com (mail-wm0-f47.google.com [74.125.82.47]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 8D55D5FD8E for ; Thu, 12 Oct 2017 04:21:47 +0000 (UTC) Received: by mail-wm0-f47.google.com with SMTP id u138so9716138wmu.4 for ; Wed, 11 Oct 2017 21:21:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to; bh=PZsDe9t73GqxHXquhJBioUvPrj6L8QYbYIBIj1Q5xNI=; b=egGLPdLCW+6d60xOoB64IdGBKKQlRBph7iL2bbz2hIbv5zfOF27nv8don3J478x+I6 86JAGcSxlk1893tXtA0OF1i/boFbJEf4W0WLrkpUIxlzRz82ZlcUYIxQ+YvkoMUtK6Lq jImvWe6zFtY/UuNnhTVMHpAfReZsW+bqkXnxKTsIfmOEzjJE8E75yUVHFlp+T4oNzxg+ D40WaUxM23HOrhLaLOGPPenGm4eXcTws3wTVteGwPWLkCHuNyI//kzu/0JZDluqvAYhf xj9SradNG2ruMpfmdg/RYqxDqM7L2vrgaQu/iqzN5Bl3I9Cfxyw5fFwOewUjYYFwyX9G z9Jw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to; bh=PZsDe9t73GqxHXquhJBioUvPrj6L8QYbYIBIj1Q5xNI=; b=R3HFk/DKZ6sVESxt6bWRXeQA/fl+2VYEZS9FSKgdq3PovRhHqDDhI6ngkcJ35sr+YA hjPpaD/nPyJAj6+Awi5uUvnEUhAuwyCIo9Z8qobEDQC6cN+QGwz222KkMNg54cPdHjdn vuCFyCddiFnMxydMqCZyCU06m9Zg4cjeyW8KOBshWnbGwm0Gx85umNWxt7dJ1pYidhlv wBlK9ujj24zthiafhv16MqpvtamODux6a6o5P6WO8GEzMMzMIAH188d3A1WckspsQwHw VarMs63IxG+KYC17fUULJbacIJeStbyoNKQjaj319a8cW6/rN7M4CG74mYBuWKS3Vmwr L9Pw== X-Gm-Message-State: AMCzsaXrhpHJqqnDy3hlDFHvH3JRW5Mz5/aAtoPYwyQ+FrgVaA7UGoEg L3Sd0NPGAb3IQqOaKUHMqeuTW5r72qZOYrAsv45d0w== X-Google-Smtp-Source: AOwi7QDpGZISKKipN/1HXmEmao0qqOhlSt1Q9yMv0TDM3i/NgsDFhHEPn31yg0Y1TXFnSSeTkfN04DHY9NBhPtGQmis= X-Received: by 10.223.187.65 with SMTP id x1mr774990wrg.26.1507782106157; Wed, 11 Oct 2017 21:21:46 -0700 (PDT) MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.223.135.12 with HTTP; Wed, 11 Oct 2017 21:21:45 -0700 (PDT) In-Reply-To: References: From: Stack Date: Wed, 11 Oct 2017 21:21:45 -0700 X-Google-Sender-Auth: zNQumDMBpSIszdXouykXiGStf8I Message-ID: Subject: Re: [DISCUSS] options for precommit test reliability? To: HBase Dev List Content-Type: multipart/alternative; boundary="089e0820ee4cd61912055b51e069" archived-at: Thu, 12 Oct 2017 04:21:54 -0000 --089e0820ee4cd61912055b51e069 Content-Type: text/plain; charset="UTF-8" On Wed, Oct 11, 2017 at 10:19 AM, Stack wrote: > Thats a lovely report Busbey. > > Let me see if I can get a rough answer to your question on minicluster > cores. > > On a clean machine w/ 48 cores, we spend an hour or so on 'smalltests' (no fork). We're using less than 10% of the CPUs (vmstat says ~95% idle). No io. When we get to the second part of the test run (medium+large), CPU goes up (fork = 5) and we move up to maybe 15% of CPU (vmstat is >85+ idle). I can't go beyond because tests are failing and timing out, even on a 'clean' machine (Let me try w/ the flakies list in place). If I up the forking -- 1/4 of the CPUs for small tests and 1/2 for medium/large -- we seem to spin through the smalls fast (15mins or less -- all pass). The mediums seem to fluctuate between 15-60% of CPU. Overall, I did more tests in 1/4 time w/ upped forking (30odd mins vs two hours). It would seem that our defaults are anemic (Currently we use ~3-4 cores for small test run and 8-10 cores for medium/large). Could have fun setting fork count based off hardware. Could bring down our elapsed time for test runs.. In the past, surefire used to lose a few tests when high-concurrency. It might be better now. St.Ack > S > > > On Wed, Oct 11, 2017 at 6:43 AM, Sean Busbey wrote: > >> Currently our precommit build has a history of ~233 builds. >> >> Looking across[1] those for those with unit test logs, and treating >> the string "timeout" as an indicator that things failed because of >> timeout rather than a known bad answer, we have 80 builds that had one >> or more test timeout. >> >> breaking this down by host: >> >> | Host | % timeout | Success | Timeout Failure | General Failure | >> | ---- | ---------:| -------:| ---------------:| ---------------:| >> | H0 | 42% | 10 | 15 | 11 | >> | H1 | 54% | 6 | 14 | 6 | >> | H2 | 45% | 18 | 35 | 24 | >> | H3 | 100% | 0 | 1 | 0 | >> | H4 | 0% | 1 | 0 | 2 | >> | H5 | 20% | 1 | 1 | 3 | >> | H6 | 44% | 4 | 4 | 1 | >> | H9 | 35% | 2 | 7 | 11 | >> | H10 | 26% | 4 | 8 | 19 | >> | H11 | 0% | 0 | 0 | 2 | >> | H12 | 43% | 1 | 3 | 3 | >> | H13 | 22% | 1 | 2 | 6 | >> | H26 | 0% | 0 | 0 | 1 | >> >> >> It's odd that we so strongly favor H2. But I don't see evidence that >> we have a bad host that we could just exclude. >> >> Scaling our concurrency by number of CPU cores is something surefire >> can do. Let me see what the H* hosts look like to figure out some >> example mappings. Do we have a rough bound on how many cores a single >> test using MiniCluster should need? 3? >> >> -busbey >> >> [1]: By "looking across" I mean using the python-jenkins library >> >> https://gist.github.com/busbey/ff5f7ae3a292164cc110fdb934935c8c >> >> >> >> On Mon, Oct 9, 2017 at 4:40 PM, Stack wrote: >> > On Mon, Oct 9, 2017 at 7:38 AM, Sean Busbey wrote: >> > >> >> Hi folks! >> >> >> >> Lately our precommit runs have had a large amount of noise around unit >> >> test failures due to timeout, especially for the hbase-server module. >> >> >> >> >> > I've not looked at why the timeouts. Anyone? Usually there is a cause. >> > >> > ... >> > >> > >> >> I'd really like to get us back to a place where a precommit -1 doesn't >> >> just result in a reflexive "precommit is unreliable." >> > >> > >> > This is the default. The exception is one of us works on stabilizing >> test >> > suite. It takes a while and a bunch of effort but stabilization has been >> > doable in the past. Once stable, it stays that way a while before the >> rot >> > sets in. >> > >> > >> > >> >> * Do fewer parallel executions. We do 5 tests at once now and the >> >> hbase-server module takes ~1.5 hours. We could tune down just the >> >> hbase-server module to do fewer. >> >> >> > >> > >> > Is it the loading that is the issue or tests stamping on each other. If >> > latter, I'd think we'd want to fix it. If former, would want to look at >> it >> > too; I'd think our tests shouldn't be such that they fall over if the >> > context is other than 'perfect'. >> > >> > I've not looked at a machine when five concurrent hbase tests running. >> Is >> > it even putting up a load? Over the extent of the full test suite? Or >> is it >> > that it is just a few tests that when run together, they cause issue. >> Could >> > we stagger these or give them their own category or have them burn less >> > brightly? >> > >> > If tests are failing because contention for resources, we should fix the >> > test. If given a machine, we should burn it up rather than pussy-foot it >> > I'd say (can we size the concurrency off a query of the underlying OS >> so we >> > step by CPUs say?). >> > >> > Tests could do with an edit. Generally, tests are written once and then >> > never touched again. Meantime the system evolves. Edit could look for >> > redundancy. Edit could look for cases where we start clusters >> > --timeconsumming-- and we don't have to (use Mocks or start standalone >> > instances instead). We also have some crazy tests that spin up lots of >> > clusters all inside a single JVM though the context is the same as that >> of >> > a simple method evaluation. >> > >> > St.Ack >> > > --089e0820ee4cd61912055b51e069--