geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinmei Liao <jil...@pivotal.io>
Subject Re: [DISCUSS] When is a test not flaky anymore?
Date Fri, 06 Jul 2018 13:56:12 GMT
+1 for removing flaky category and fix as failure occurs.

On Thu, Jul 5, 2018 at 8:21 PM Dan Smith <dsmith@pivotal.io> wrote:

> Honestly I've never liked the flaky category. What it means is that at some
> point in the past, we decided to put off tracking down and fixing a failure
> and now we're left with a bug number and a description and that's it.
>
> I think we will be better off if we just get rid of the flaky category
> entirely. That way no one can label anything else as flaky and push it off
> for later, and if flaky tests fail again we will actually prioritize and
> fix them instead of ignoring them.
>
> I think Patrick was looking at rerunning the flaky tests to see what is
> still failing. How about we just run the whole flaky suite some number of
> times (100?), fix whatever is still failing and close out and remove the
> category from the rest?
>
> I think will we get more benefit from shaking out and fixing the issues we
> have in the current codebase than we will from carefully explaining the
> flaky failures from the past.
>
> -Dan
>
> On Thu, Jul 5, 2018 at 7:03 PM, Dale Emery <demery@pivotal.io> wrote:
>
> > Hi Alexander and all,
> >
> > > On Jul 5, 2018, at 5:11 PM, Alexander Murmann <amurmann@pivotal.io>
> > wrote:
> > >
> > > Hi everyone!
> > >
> > > Dan Smith started a discussion about shaking out more flaky DUnit
> tests.
> > > That's a great effort and I am happy it's happening.
> > >
> > > As a corollary to that conversation I wonder what the criteria should
> be
> > > for a test to not be considered flaky any longer and have the category
> > > removed.
> > >
> > > In general the bar should be fairly high. Even if a test only fails ~1
> in
> > > 500 runs that's still a problem given how many tests we have.
> > >
> > > I see two ends of the spectrum:
> > > 1. We have a good understanding why the test was flaky and think we
> fixed
> > > it.
> > > 2. We have a hard time reproducing the flaky behavior and have no good
> > > theory as to why the test might have shown flaky behavior.
> > >
> > > In the first case I'd suggest to run the test ~100 times to get a
> little
> > > more confidence that we fixed the flaky behavior and then remove the
> > > category.
> >
> > Here’s a test for case 1:
> >
> > If we really understand why it was flaky, we will be able to:
> >     - Identify the “faults”—the broken places in the code (whether system
> > code or test code).
> >     - Identify the exact conditions under which those faults led to the
> > failures we observed.
> >     - Explain how those faults, under those conditions. led to those
> > failures.
> >     - Run unit tests that exercise the code under those same conditions,
> > and demonstrate that
> >       the formerly broken code now does the right thing.
> >
> > If we’re lacking any of these things, I’d say we’re dealing with case 2.
> >
> > > The second case is a lot more problematic. How often do we want to run
> a
> > > test like that before we decide that it might have been fixed since we
> > last
> > > saw it happen? Anything else we could/should do to verify the test
> > deserves
> > > our trust again?
> >
> >
> > I would want a clear, compelling explanation of the failures we observed.
> >
> > Clear and compelling are subjective, of course. For me, clear and
> > compelling would include
> > descriptions of:
> >    - The faults in the code. What, specifically, was broken.
> >    - The specific conditions under which the code did the wrong thing.
> >    - How those faults, under those conditions, led to those failures.
> >    - How the fix either prevents those conditions, or causes the formerly
> > broken code to
> >      now do the right thing.
> >
> > Even if we don’t have all of these elements, we may have some of them.
> > That can help us
> > calibrate our confidence. But the elements work together. If we’re
> lacking
> > one, the others
> > are shaky, to some extent.
> >
> > The more elements are missing in our explanation, the more times I’d want
> > to run the test
> > before trusting it.
> >
> > Cheers,
> > Dale
> >
> > —
> > Dale Emery
> > demery@pivotal.io
> >
> >
>


-- 
Cheers

Jinmei

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message