hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matteo Bertozzi <theo.berto...@gmail.com>
Subject Re: Smart Flaky Handler
Date Fri, 20 May 2016 20:17:48 GMT
any suggestion on how to make people aware of the tests being flaky?

for example I will have never notice the procedure test being flaky if was
not for stack posting the list here.
so, maybe a weekly digest in the dev-list with the list of flaky will get
more audience than having people go into the job.

also, I was thinking about how do I notice if I broke something when I post
a patch.
since we exclude the flakys from the run, there is no way I can notice I
broke something from QA.
maybe we can add a section in QA that runs the flaky ones and tells you
"those are failed but may be flaky"
and at least can look if the failures are related to the patch or is just
flaky.

On Fri, May 20, 2016 at 11:03 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> Nice work Appy! What do I need to do to get it wired up for branch-1.1?
>
> On Fri, May 20, 2016 at 9:25 AM, Stack <stack@duboce.net> wrote:
>
> > The system seems to be working nicely Appy. We are getting green
> precommit
> > builds for the first time in ages.
> >
> > Should we change the includes and excludes lists so they have a file type
> > ending? .txt? Then I could open them easily in the browser. Currently I
> > have to download them.
> >
> > Includes are tests that are currently considered 'flakey'?
> >
> >
> >
> TestGenerateDelegationToken,TestMobCompactor,TestRegionServerMetrics,TestAcidGuarantees,TestMasterReplication,TestRowProcessorEndpoint,TestAsyncLogRolling,DynamicLogicExpressionSuite,TestMasterFailoverWithProcedures,TestChoreService,TestScannerHeartbeatMessages,TestWALProcedureStore,TestRegionMergeTransactionOnCluster,TestSaslFanOutOneBlockAsyncDFSOutput,TestReplicationEndpointWithMultipleWAL
> >
> > We have a nice list.
> >
> > Excludes are:
> >
> >
> >
> **/TestGenerateDelegationToken.java,**/TestMobCompactor.java,**/TestRegionServerMetrics.java,**/TestAcidGuarantees.java,**/TestMasterReplication.java,**/TestRowProcessorEndpoint.java,**/TestAsyncLogRolling.java,**/DynamicLogicExpressionSuite.java,**/TestMasterFailoverWithProcedures.java,**/TestChoreService.java,**/TestScannerHeartbeatMessages.java,**/TestWALProcedureStore.java,**/TestRegionMergeTransactionOnCluster.java,**/TestSaslFanOutOneBlockAsyncDFSOutput.java,**/TestReplicationEndpointWithMultipleWAL.java,
> >
> > Whats the '**/' about? Is it supposed to have opening/closing versions?
> >
> > Thanks boss,
> > St.
> >
> >
> >
> > On Mon, May 16, 2016 at 4:45 PM, Stack <stack@duboce.net> wrote:
> >
> > > Sweet!
> > >
> > > On Mon, May 16, 2016 at 4:38 PM, Apekshit Sharma <appy@cloudera.com>
> > > wrote:
> > >
> > >> This mail is to introduce the work to tackle the flaky tests in our
> > build.
> > >>
> > >> *Why is it important?*
> > >> - Our build history sucks, last 175 post-commit runs failed. We need
> to
> > >> make it useful.
> > >> - To better understand our code’s testing status, more importantly
> it’s
> > >> weak points.
> > >> - We know those 2-3 tests which keep failing every now and then, but
> not
> > >> those ~10 nasty ones which fail like 1 out of 50 times, and screw our
> > build.
> > >> - This isn’t something that can be done manually on a daily basis. We
> > >> need automation.
> > >>
> > >> *Changes made so far:*
> > >> Code changes: HBASE-15839
> > >> <https://issues.apache.org/jira/browse/HBASE-15839>  (Umbrella issue)
> > >>
> > >> *Jenkins changes:*
> > >>
> > >>
> > >> [Diagram link:
> > >>
> >
> https://issues.apache.org/jira/secure/attachment/12804292/Screen%20Shot%202016-05-16%20at%204.02.46%20PM.png
> > >> ]
> > >> ​
> > >> *(new job) HBase-Find-Flaky-Tests*: Gets test reports of recent builds
> > >> of post-commit job (TRUNK_matrix) and HBase-Flaky-Tests job (see
> below)
> > to
> > >> find flaky tests. Frequency of run determines how fast we catch test
> > >> regressions. So if we run it every 4 hours, any test which started
> > failing
> > >> in post-commit job (TRUNK_matrix) in last 4 hour will be blacklisted.
> > >>
> > >> *(new job) HBase-Flaky-Tests*: This job runs only the flaky tests. The
> > >> aim is to run this job back-to-back to collect as many runs as we can.
> > >> Higher the run rate, the better will be our system at catching the
> flaky
> > >> tests. We currently run it hourly. so we’ll be able to keep track of
> > flaky
> > >> tests with ~5% failure rate or more.
> > >>
> > >> *Post-commit (TRUNK_matrix) and pre-commit jobs*: Exclude these flaky
> > >> tests.
> > >>
> > >>
> > >> *So what if a bad commit makes a good test bad?*
> > >> Since the test is not bad, it’ll run in next post-commit and will
> fail.
> > >> Next run of HBase-Find-Flaky-Tests will  pick it up and blacklist it.
> > >> Blacklisting will help keep the post-commit job and more importantly
> > >> pre-commit job clean, a problem we face quite often.
> > >>
> > >> *Are we just tucking away are shit?*
> > >> Nope, this will help us:
> > >> - first, Maintain a list of bad test (we lack that today).
> > >> - second, make our build greener to the point that a failed/red build
> is
> > >> something we worry about seriously.
> > >>
> > >> Once we are confident that the system is working fine, we’ll setup up
> > >> HBase-Find-Flaky-Tests job to send reports to dev@hbase so that devs
> > >> know about the bad tests. If it remains hidden somewhere in a jenkins
> > job’s
> > >> archive, it’s unlike that we’ll actively work on getting them fixed
> :).
> > >>
> > >> I'll keep posting further updates on this thread.
> > >>
> > >> -- Appy
> > >>
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message