hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apekshit Sharma <a...@cloudera.com>
Subject Re: Smart Flaky Handler
Date Fri, 20 May 2016 22:29:56 GMT
>
> Should we change the includes and excludes lists so they have a file type ending?
> .txt? Then I could open them easily in the browser. Currently I have to
> download them.


Clicking on 'view' next to the filename in artifacts list will show the
contents directly.

Whats the '**/' about? Is it supposed to have opening/closing versions?


There are two files includes and excludes because the two maven flags
(-Dtest and -Dtest.exclude.pattern) to include and exclude tests to be run
require different format. In surefire 2.19.1, this wouldn't have been
required since the two flags, surefire.excludesFile and
surefire.includesFile, can use same file containing list of tests. We tried
updating to  2.19.1, but started seeing lots of timeouts so reverted back.
Two files isn't pretty, but works with what we have.

What do I need to do to get it wired up for branch-1.1


Haven't thought about it fully since i was more focused on making it robust
first (see the coming changes below). But it may be simply
 - setting up another job like HBase-Find-Flaky-Tests[1] which builds flaky
list for 1.1
- then based on branch, we can source the list of flaky tests from
appropriate Find-Flaky-Tests  job

any suggestion on how to make people aware of the tests being flaky?


Once it's complete, maybe set it up to send mails to dev@hbase?
Although it's good now, there are few changes i have in mind which can make
it dramatically better.

Next changes:
1. Change the report-flakies.py script to also report timed-out tests.

2. Currently we use last 25 builds of HBase-Flaky-Tests [2] to see
'flakyness' of tests. I want to increase it to something like 50 or 100.
The idea is, it's much harder to detect tests which fail 1% of the time
than those which fail 20% time. But 25 such test (which fail 1%) will make
our builds fail >25% of time.
So one of the main goal is to bring down runtime of this job. I'll be doing
that by assigning appropriate timeouts to these bad tests individually, so
that they fail fast.

3. Last but most important change. Run all tests (good and flakies)
everytime and use "-fn" (fail never) option in maven builds. Then instead
of deciding success/failure based on maven's status, self analyze the test
results and decide pass/fail taking into account flaky tests list. This
might be get tricky and will have to see how things fit with yetus.


[1] https://builds.apache.org/job/HBASE-Find-Flaky-Tests/
[2] https://builds.apache.org/job/HBase-Flaky-Tests/


On Fri, May 20, 2016 at 1:17 PM, Matteo Bertozzi <theo.bertozzi@gmail.com>
wrote:

> any suggestion on how to make people aware of the tests being flaky?
>
> for example I will have never notice the procedure test being flaky if was
> not for stack posting the list here.
> so, maybe a weekly digest in the dev-list with the list of flaky will get
> more audience than having people go into the job.
>
> also, I was thinking about how do I notice if I broke something when I post
> a patch.
> since we exclude the flakys from the run, there is no way I can notice I
> broke something from QA.
> maybe we can add a section in QA that runs the flaky ones and tells you
> "those are failed but may be flaky"
> and at least can look if the failures are related to the patch or is just
> flaky.
>
> On Fri, May 20, 2016 at 11:03 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:
>
> > Nice work Appy! What do I need to do to get it wired up for branch-1.1?
> >
> > On Fri, May 20, 2016 at 9:25 AM, Stack <stack@duboce.net> wrote:
> >
> > > The system seems to be working nicely Appy. We are getting green
> > precommit
> > > builds for the first time in ages.
> > >
> > > Should we change the includes and excludes lists so they have a file
> type
> > > ending? .txt? Then I could open them easily in the browser. Currently I
> > > have to download them.
> > >
> > > Includes are tests that are currently considered 'flakey'?
> > >
> > >
> > >
> >
> TestGenerateDelegationToken,TestMobCompactor,TestRegionServerMetrics,TestAcidGuarantees,TestMasterReplication,TestRowProcessorEndpoint,TestAsyncLogRolling,DynamicLogicExpressionSuite,TestMasterFailoverWithProcedures,TestChoreService,TestScannerHeartbeatMessages,TestWALProcedureStore,TestRegionMergeTransactionOnCluster,TestSaslFanOutOneBlockAsyncDFSOutput,TestReplicationEndpointWithMultipleWAL
> > >
> > > We have a nice list.
> > >
> > > Excludes are:
> > >
> > >
> > >
> >
> **/TestGenerateDelegationToken.java,**/TestMobCompactor.java,**/TestRegionServerMetrics.java,**/TestAcidGuarantees.java,**/TestMasterReplication.java,**/TestRowProcessorEndpoint.java,**/TestAsyncLogRolling.java,**/DynamicLogicExpressionSuite.java,**/TestMasterFailoverWithProcedures.java,**/TestChoreService.java,**/TestScannerHeartbeatMessages.java,**/TestWALProcedureStore.java,**/TestRegionMergeTransactionOnCluster.java,**/TestSaslFanOutOneBlockAsyncDFSOutput.java,**/TestReplicationEndpointWithMultipleWAL.java,
> > >
> > > Whats the '**/' about? Is it supposed to have opening/closing versions?
> > >
> > > Thanks boss,
> > > St.
> > >
> > >
> > >
> > > On Mon, May 16, 2016 at 4:45 PM, Stack <stack@duboce.net> wrote:
> > >
> > > > Sweet!
> > > >
> > > > On Mon, May 16, 2016 at 4:38 PM, Apekshit Sharma <appy@cloudera.com>
> > > > wrote:
> > > >
> > > >> This mail is to introduce the work to tackle the flaky tests in our
> > > build.
> > > >>
> > > >> *Why is it important?*
> > > >> - Our build history sucks, last 175 post-commit runs failed. We need
> > to
> > > >> make it useful.
> > > >> - To better understand our code’s testing status, more importantly
> > it’s
> > > >> weak points.
> > > >> - We know those 2-3 tests which keep failing every now and then, but
> > not
> > > >> those ~10 nasty ones which fail like 1 out of 50 times, and screw
> our
> > > build.
> > > >> - This isn’t something that can be done manually on a daily basis.
> We
> > > >> need automation.
> > > >>
> > > >> *Changes made so far:*
> > > >> Code changes: HBASE-15839
> > > >> <https://issues.apache.org/jira/browse/HBASE-15839>  (Umbrella
> issue)
> > > >>
> > > >> *Jenkins changes:*
> > > >>
> > > >>
> > > >> [Diagram link:
> > > >>
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12804292/Screen%20Shot%202016-05-16%20at%204.02.46%20PM.png
> > > >> ]
> > > >> ​
> > > >> *(new job) HBase-Find-Flaky-Tests*: Gets test reports of recent
> builds
> > > >> of post-commit job (TRUNK_matrix) and HBase-Flaky-Tests job (see
> > below)
> > > to
> > > >> find flaky tests. Frequency of run determines how fast we catch test
> > > >> regressions. So if we run it every 4 hours, any test which started
> > > failing
> > > >> in post-commit job (TRUNK_matrix) in last 4 hour will be
> blacklisted.
> > > >>
> > > >> *(new job) HBase-Flaky-Tests*: This job runs only the flaky tests.
> The
> > > >> aim is to run this job back-to-back to collect as many runs as we
> can.
> > > >> Higher the run rate, the better will be our system at catching the
> > flaky
> > > >> tests. We currently run it hourly. so we’ll be able to keep track
of
> > > flaky
> > > >> tests with ~5% failure rate or more.
> > > >>
> > > >> *Post-commit (TRUNK_matrix) and pre-commit jobs*: Exclude these
> flaky
> > > >> tests.
> > > >>
> > > >>
> > > >> *So what if a bad commit makes a good test bad?*
> > > >> Since the test is not bad, it’ll run in next post-commit and will
> > fail.
> > > >> Next run of HBase-Find-Flaky-Tests will  pick it up and blacklist
> it.
> > > >> Blacklisting will help keep the post-commit job and more importantly
> > > >> pre-commit job clean, a problem we face quite often.
> > > >>
> > > >> *Are we just tucking away are shit?*
> > > >> Nope, this will help us:
> > > >> - first, Maintain a list of bad test (we lack that today).
> > > >> - second, make our build greener to the point that a failed/red
> build
> > is
> > > >> something we worry about seriously.
> > > >>
> > > >> Once we are confident that the system is working fine, we’ll setup
> up
> > > >> HBase-Find-Flaky-Tests job to send reports to dev@hbase so that
> devs
> > > >> know about the bad tests. If it remains hidden somewhere in a
> jenkins
> > > job’s
> > > >> archive, it’s unlike that we’ll actively work on getting them
fixed
> > :).
> > > >>
> > > >> I'll keep posting further updates on this thread.
> > > >>
> > > >> -- Appy
> > > >>
> > > >
> > > >
> > >
> >
>



-- 

Regards

Apekshit Sharma | Software Engineer, Cloudera | Palo Alto, California |
650-963-6311

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message