hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zach York <zyork.contribut...@gmail.com>
Subject Re: Flaky dashboard for current branch-2
Date Sat, 13 Jan 2018 02:02:05 GMT
Thanks for the explanation Appy!

bq. I think we can actually update the script to send a mail to dev@ when it
encounters these 100% failing tests. Waana try? :)

That would be cool, shame people into fixing tests :) I can try to take a
look into that.



On Fri, Jan 12, 2018 at 5:01 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> There is more than one reason.
>
> Sometimes QA reported tests in a module failed.
> When artifact/patchprocess/patch-unit-hbase-server.txt is checked, there
> were more than one occurrence of the following :
>
> https://pastebin.com/WBewfj3Q
>
> It is hard to decipher what was behind the crash.
> Finding hanging test currently is not automated.
>
> Also note the following at the beginning of the test run:
>
> https://pastebin.com/sK6ebk84
>
> FYI
>
> On Fri, Jan 12, 2018 at 4:35 PM, 张铎(Duo Zhang) <palomino219@gmail.com>
> wrote:
>
> > Why a 100% failure test can not be detected with pre commit check?
> >
> > Ted Yu <yuzhihong@gmail.com>于2018年1月13日 周六07:44写道:
> >
> > > As we get closer and closer to beta release, it is important to have as
> > few
> > > flaky tests as possible.
> > >
> > > bq. we can actually update the script to send a mail to dev@
> > >
> > > A post to the JIRA which caused the 100% failing test would be better.
> > > The committer would notice the post and take corresponding action.
> > >
> > > Cheers
> > >
> > > On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <appy@cloudera.com>
> > > wrote:
> > >
> > > > >   Is Nightly now using a list of flakes?
> > > > Dashboard job was flaky yesterday, so didn't start using it. Looks
> like
> > > > it's working fine now. Let me exclude flakies from nightly job.
> > > >
> > > > > Just took a look at the dashboard. Does this capture only failed
> runs
> > > or
> > > > all
> > > > runs?
> > > > Sorry the question isn't clear. Runs of what?
> > > > Here's an attempt to answer it in best way i can understand - it
> looks
> > at
> > > > last X (x=6 now) runs of nightly branch-2 to collect failing,
> hanging,
> > > and
> > > > timedout tests.
> > > >
> > > > > I see that the following tests have failed 100% of the time for the
> > > last
> > > > 30
> > > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > > rather a
> > > > > legitimate failure, right?
> > > > > Maybe this tool is used to see all test failures, but if not, I
> feel
> > > like
> > > > > we could/should remove a test from the flaky tests/excludes if it
> > fails
> > > > > consistently so we can fix the root cause
> > > >
> > > > Has come up a lot of times before. Yes, you're right 100% failure =
> > > > legitimate failure.
> > > > <rant>
> > > > We as a community suck at tracking nightly runs for failing tests and
> > > > fixing them, otherwise we wouldn't have ~40 bad test, right!
> > > > In fact, we suck at fixing tests even when it's presented in a nice
> > clean
> > > > list (this dashboard). We just don't prioritize tests in our work.
> > > > The general attitude is, tests are failing...meh..what's new, have
> been
> > > > failing for years. Instead of - Oh, one test failed, find the cause
> and
> > > > revert it!
> > > > So the real thing to change here is attitude of the community towards
> > > > tests. I am +1 for anything that'll promote/support that change.
> > > > </rant>
> > > > I think we can actually update the script to send a mail to dev@
> when
> > it
> > > > encounters these 100% failing tests. Waana try? :)
> > > >
> > > > -- Appy
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 12, 2018 at 11:29 AM, Zach York <
> > > zyork.contribution@gmail.com>
> > > > wrote:
> > > >
> > > > > Just took a look at the dashboard. Does this capture only failed
> runs
> > > or
> > > > > all runs?
> > > > >
> > > > > I see that the following tests have failed 100% of the time for the
> > > last
> > > > 30
> > > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > > rather a
> > > > > legitimate failure, right?
> > > > > Maybe this tool is used to see all test failures, but if not, I
> feel
> > > like
> > > > > we could/should remove a test from the flaky tests/excludes if it
> > fails
> > > > > consistently so we can fix the root cause.
> > > > >
> > > > > [1]
> > > > > master.balancer.TestRegionsOnMasterOptions
> > > > > client.TestMultiParallel
> > > > > regionserver.TestRegionServerReadRequestMetrics
> > > > >
> > > > > Thanks,
> > > > > Zach
> > > > >
> > > > > On Fri, Jan 12, 2018 at 8:19 AM, Stack <stack@duboce.net> wrote:
> > > > >
> > > > > > Dashboard doesn't capture timed out tests, right Appy?
> > > > > > Thanks,
> > > > > > S
> > > > > >
> > > > > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <
> > appy@cloudera.com>
> > > > > > wrote:
> > > > > >
> > > > > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > > @stack: when you branch out branch-2.0, let me know, i'll
> update
> > > the
> > > > > jobs
> > > > > > > to point to that branch so that it's helpful in release.
Once
> > > release
> > > > > is
> > > > > > > done, i'll move them back to "branch-2".
> > > > > > >
> > > > > > >
> > > > > > > -- Appy
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message