geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacob Barrett <jbarr...@pivotal.io>
Subject Re: [DISCUSS] When is a test not flaky anymore?
Date Mon, 09 Jul 2018 14:19:59 GMT
+1 and the same should go for @ignore attributes as well.

> On Jul 6, 2018, at 11:10 AM, Alexander Murmann <amurmann@pivotal.io> wrote:
> 
> +1 for fixing immediately.
> 
> Since Dan is already trying to shake out more brittleness this seems to be
> the right time to get rid of the flaky label. Let's just treat all test the
> same and fix them.
> 
>> On Fri, Jul 6, 2018 at 9:31 AM, Kirk Lund <klund@apache.org> wrote:
>> 
>> I should add that I'm only in favor of deleting the category if we have a
>> new policy of any failure means we have to fix the test and/or product
>> code. Even if you think that failure is in a test that you or your team is
>> not responsible for. That's no excuse to ignore a failure in your private
>> precheckin.
>> 
>>> On Fri, Jul 6, 2018 at 9:29 AM, Dale Emery <demery@pivotal.io> wrote:
>>> 
>>> The pattern I’ve seen in lots of other organizations: When a few tests
>>> intermittently give different answers, people attribute the intermittence
>>> to the tests, quickly lose trust in the entire suite, and increasingly
>>> discount failures.
>>> 
>>> If we’re going to attend to every failure in the larger suite, then we
>>> won’t suffer that fate, and I’m in favor of deleting the Flaky tag.
>>> 
>>> Dale
>>> 
>>>> On Jul 5, 2018, at 8:15 PM, Dan Smith <dsmith@pivotal.io> wrote:
>>>> 
>>>> Honestly I've never liked the flaky category. What it means is that at
>>> some
>>>> point in the past, we decided to put off tracking down and fixing a
>>> failure
>>>> and now we're left with a bug number and a description and that's it.
>>>> 
>>>> I think we will be better off if we just get rid of the flaky category
>>>> entirely. That way no one can label anything else as flaky and push it
>>> off
>>>> for later, and if flaky tests fail again we will actually prioritize
>> and
>>>> fix them instead of ignoring them.
>>>> 
>>>> I think Patrick was looking at rerunning the flaky tests to see what is
>>>> still failing. How about we just run the whole flaky suite some number
>> of
>>>> times (100?), fix whatever is still failing and close out and remove
>> the
>>>> category from the rest?
>>>> 
>>>> I think will we get more benefit from shaking out and fixing the issues
>>> we
>>>> have in the current codebase than we will from carefully explaining the
>>>> flaky failures from the past.
>>>> 
>>>> -Dan
>>>> 
>>>>> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <demery@pivotal.io>
wrote:
>>>>> 
>>>>> Hi Alexander and all,
>>>>> 
>>>>>> On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurmann@pivotal.io>
>>>>> wrote:
>>>>>> 
>>>>>> Hi everyone!
>>>>>> 
>>>>>> Dan Smith started a discussion about shaking out more flaky DUnit
>>> tests.
>>>>>> That's a great effort and I am happy it's happening.
>>>>>> 
>>>>>> As a corollary to that conversation I wonder what the criteria should
>>> be
>>>>>> for a test to not be considered flaky any longer and have the
>> category
>>>>>> removed.
>>>>>> 
>>>>>> In general the bar should be fairly high. Even if a test only fails
>> ~1
>>> in
>>>>>> 500 runs that's still a problem given how many tests we have.
>>>>>> 
>>>>>> I see two ends of the spectrum:
>>>>>> 1. We have a good understanding why the test was flaky and think
we
>>> fixed
>>>>>> it.
>>>>>> 2. We have a hard time reproducing the flaky behavior and have no
>> good
>>>>>> theory as to why the test might have shown flaky behavior.
>>>>>> 
>>>>>> In the first case I'd suggest to run the test ~100 times to get a
>>> little
>>>>>> more confidence that we fixed the flaky behavior and then remove
the
>>>>>> category.
>>>>> 
>>>>> Here’s a test for case 1:
>>>>> 
>>>>> If we really understand why it was flaky, we will be able to:
>>>>>   - Identify the “faults”—the broken places in the code (whether
>> system
>>>>> code or test code).
>>>>>   - Identify the exact conditions under which those faults led to the
>>>>> failures we observed.
>>>>>   - Explain how those faults, under those conditions. led to those
>>>>> failures.
>>>>>   - Run unit tests that exercise the code under those same
>> conditions,
>>>>> and demonstrate that
>>>>>     the formerly broken code now does the right thing.
>>>>> 
>>>>> If we’re lacking any of these things, I’d say we’re dealing with
case
>> 2.
>>>>> 
>>>>>> The second case is a lot more problematic. How often do we want to
>> run
>>> a
>>>>>> test like that before we decide that it might have been fixed since
>> we
>>>>> last
>>>>>> saw it happen? Anything else we could/should do to verify the test
>>>>> deserves
>>>>>> our trust again?
>>>>> 
>>>>> 
>>>>> I would want a clear, compelling explanation of the failures we
>>> observed.
>>>>> 
>>>>> Clear and compelling are subjective, of course. For me, clear and
>>>>> compelling would include
>>>>> descriptions of:
>>>>>  - The faults in the code. What, specifically, was broken.
>>>>>  - The specific conditions under which the code did the wrong thing.
>>>>>  - How those faults, under those conditions, led to those failures.
>>>>>  - How the fix either prevents those conditions, or causes the
>> formerly
>>>>> broken code to
>>>>>    now do the right thing.
>>>>> 
>>>>> Even if we don’t have all of these elements, we may have some of them.
>>>>> That can help us
>>>>> calibrate our confidence. But the elements work together. If we’re
>>> lacking
>>>>> one, the others
>>>>> are shaky, to some extent.
>>>>> 
>>>>> The more elements are missing in our explanation, the more times I’d
>>> want
>>>>> to run the test
>>>>> before trusting it.
>>>>> 
>>>>> Cheers,
>>>>> Dale
>>>>> 
>>>>> —
>>>>> Dale Emery
>>>>> demery@pivotal.io
>>>>> 
>>>>> 
>>> 
>>> —
>>> Dale Emery
>>> demery@pivotal.io
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

Mime
View raw message