Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <CADcMMgEV=z9zVfa9Sy22KwQhLdToyEFZqTtN7NrGVgU1XJchug@mail.gmail.com>
References: <CAN5cbe4ce_TUFot0C=zq-_EZosxhmwAk01rNy1BYV9xiqgdrAQ@mail.gmail.com>
 <CADcMMgGXogCjo6uJn-0EZhFWSz2xDzC3P629=AG_Tcx6PmbaFQ@mail.gmail.com>
 <CAN5cbe4ggphX2GqwQNDwRY63q28iou+s13nL1LOPG4Zip94+9w@mail.gmail.com> <CADcMMgEV=z9zVfa9Sy22KwQhLdToyEFZqTtN7NrGVgU1XJchug@mail.gmail.com>
From: Stack <stack@duboce.net>
Date: Wed, 11 Oct 2017 21:21:45 -0700
Message-ID: <CADcMMgE9w-VJsAqxHV9sX9d-826Vr9KgO965pvAs=jwsT=aByg@mail.gmail.com>
Subject: Re: [DISCUSS] options for precommit test reliability?
To: HBase Dev List <dev@hbase.apache.org>
Content-Type: multipart/alternative; boundary="089e0820ee4cd61912055b51e069"
archived-at: Thu, 12 Oct 2017 04:21:54 -0000

--089e0820ee4cd61912055b51e069
Content-Type: text/plain; charset="UTF-8"

On Wed, Oct 11, 2017 at 10:19 AM, Stack <stack@duboce.net> wrote:

> Thats a lovely report Busbey.
>
> Let me see if I can get a rough answer to your question on minicluster
> cores.
>
>
On a clean machine w/ 48 cores, we spend an hour or so on 'smalltests' (no
fork). We're using less than 10% of the CPUs (vmstat says ~95% idle). No
io. When we get to the second part of the test run (medium+large), CPU goes
up (fork = 5) and we move up to maybe 15% of CPU (vmstat is >85+ idle).

I can't go beyond because tests are failing and timing out, even on a
'clean' machine (Let me try w/ the flakies list in place).

If I up the forking -- 1/4 of the CPUs for small tests and 1/2 for
medium/large -- we seem to spin through the smalls fast (15mins or less --
all pass). The mediums seem to fluctuate between 15-60% of CPU. Overall, I
did more tests in 1/4 time w/ upped forking (30odd mins vs two hours).

It would seem that our defaults are anemic (Currently we use ~3-4 cores for
small test run and 8-10 cores for medium/large).

Could have fun setting fork count based off hardware. Could bring down our
elapsed time for test runs.. In the past, surefire used to lose a few tests
when high-concurrency. It might be better now.

St.Ack


> S
>
>
> On Wed, Oct 11, 2017 at 6:43 AM, Sean Busbey <busbey@apache.org> wrote:
>
>> Currently our precommit build has a history of ~233 builds.
>>
>> Looking across[1] those for those with unit test logs, and treating
>> the string "timeout" as an indicator that things failed because of
>> timeout rather than a known bad answer, we have 80 builds that had one
>> or more test timeout.
>>
>> breaking this down by host:
>>
>> | Host | % timeout | Success | Timeout Failure | General Failure |
>> | ---- | ---------:| -------:| ---------------:| ---------------:|
>> | H0   | 42%       | 10      | 15              | 11              |
>> | H1   | 54%       | 6       | 14              | 6               |
>> | H2   | 45%       | 18      | 35              | 24              |
>> | H3   | 100%      | 0       | 1               | 0               |
>> | H4   | 0%        | 1       | 0               | 2               |
>> | H5   | 20%       | 1       | 1               | 3               |
>> | H6   | 44%       | 4       | 4               | 1               |
>> | H9   | 35%       | 2       | 7               | 11              |
>> | H10  | 26%       | 4       | 8               | 19              |
>> | H11  | 0%        | 0       | 0               | 2               |
>> | H12  | 43%       | 1       | 3               | 3               |
>> | H13  | 22%       | 1       | 2               | 6               |
>> | H26  | 0%        | 0       | 0               | 1               |
>>
>>
>> It's odd that we so strongly favor H2. But I don't see evidence that
>> we have a bad host that we could just exclude.
>>
>> Scaling our concurrency by number of CPU cores is something surefire
>> can do. Let me see what the H* hosts look like to figure out some
>> example mappings. Do we have a rough bound on how many cores a single
>> test using MiniCluster should need? 3?
>>
>> -busbey
>>
>> [1]: By "looking across" I mean using the python-jenkins library
>>
>> https://gist.github.com/busbey/ff5f7ae3a292164cc110fdb934935c8c
>>
>>
>>
>> On Mon, Oct 9, 2017 at 4:40 PM, Stack <stack@duboce.net> wrote:
>> > On Mon, Oct 9, 2017 at 7:38 AM, Sean Busbey <busbey@apache.org> wrote:
>> >
>> >> Hi folks!
>> >>
>> >> Lately our precommit runs have had a large amount of noise around unit
>> >> test failures due to timeout, especially for the hbase-server module.
>> >>
>> >>
>> > I've not looked at why the timeouts. Anyone? Usually there is a cause.
>> >
>> > ...
>> >
>> >
>> >> I'd really like to get us back to a place where a precommit -1 doesn't
>> >> just result in a reflexive "precommit is unreliable."
>> >
>> >
>> > This is the default. The exception is one of us works on stabilizing
>> test
>> > suite. It takes a while and a bunch of effort but stabilization has been
>> > doable in the past. Once stable, it stays that way a while before the
>> rot
>> > sets in.
>> >
>> >
>> >
>> >> * Do fewer parallel executions. We do 5 tests at once now and the
>> >> hbase-server module takes ~1.5 hours. We could tune down just the
>> >> hbase-server module to do fewer.
>> >>
>> >
>> >
>> > Is it the loading that is the issue or tests stamping on each other. If
>> > latter, I'd think we'd want to fix it. If former, would want to look at
>> it
>> > too; I'd think our tests shouldn't be such that they fall over if the
>> > context is other than 'perfect'.
>> >
>> > I've not looked at a machine when five concurrent hbase tests running.
>> Is
>> > it even putting up a load? Over the extent of the full test suite? Or
>> is it
>> > that it is just a few tests that when run together, they cause issue.
>> Could
>> > we stagger these or give them their own category or have them burn less
>> > brightly?
>> >
>> > If tests are failing because contention for resources, we should fix the
>> > test. If given a machine, we should burn it up rather than pussy-foot it
>> > I'd say (can we size the concurrency off a query of the underlying OS
>> so we
>> > step by CPUs say?).
>> >
>> > Tests could do with an edit. Generally, tests are written once and then
>> > never touched again. Meantime the system evolves. Edit could look for
>> > redundancy. Edit could look for cases where we start clusters
>> > --timeconsumming--  and we don't have to (use Mocks or start standalone
>> > instances instead). We also have some crazy tests that spin up lots of
>> > clusters all inside a single JVM though the context is the same as that
>> of
>> > a simple method evaluation.
>> >
>> > St.Ack
>>
>
>

--089e0820ee4cd61912055b51e069--