zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Kanivets <bkaniv...@gmail.com>
Subject Re: Decrease number of threads in Jenkins builds to reduce flakyness
Date Mon, 15 Oct 2018 07:31:18 GMT
Fangmin,

Those are good ideas.

FYI, I've stated running tests continuously in aws m1.xlarge.
https://github.com/lavacat/zookeeper-tests-lab

So far, I've done ~ 12 runs of trunk. Same common offenders as in Flaky
dash: testManyChildWatchersAutoReset, testPurgeWhenLogRollingInProgress
I'll do some more runs, then try to come up with report.

I'm using aws and not Apache Jenkins env because of better
control/observability.




On Sun, Oct 14, 2018 at 4:58 PM Fangmin Lv <lvfangmin@gmail.com> wrote:

> Internally, we also did some works to reduce the flaky, here are the main
> things we've done:
>
> * using retry rule to retry in case the zk client lost it's connection,
> this could happen if the quorum tests is running on unstable environment
> and the leader election happened.
> * using random port instead of sequentially to avoid the port racing when
> running tests concurrently
> * changing tests to avoid using the same test path when creating/deleting
> nodes
>
> These greatly reduced the flaky internally, we should try those if we're
> seeing similar issues in the Jenkins.
>
> Fangmin
>
> On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <bkanivets@gmail.com>
> wrote:
>
> > I've looked into flakiness couple months ago (special attention on
> > testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
> > Unfortunately I don't have data to back this claim.
> >
> > I don't remember seeing many 'port binding' exceptions. Unless 'port
> > assignment' issue manifested as some other exception.
> >
> > Before decreasing number of threads I think more data should be
> > collected/visualized
> >
> > 1) Flaky dashboard is great, but we should add another report that maps
> > 'error causes' to builds/tests
> > 2) Flaky dash can be extended to save more history (for example like this
> > https://www.chromium.org/developers/testing/flakiness-dashboard)
> > 3) PreCommit builds should be included in dashboard
> > 4) We should have a common clean benchmark. For example - take
> > AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
> > tests (current 8 threads) for 8 hours with 1 min cooldown.
> >
> > Due to recent employment change, I got sidetracked, but I really want to
> > get to the bottom of this.
> > I'm going to setup 4) and report results to this mailing list. Also
> willing
> > to work on other items.
> >
> >
> >
> >
> >
> >
> > On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <eolivelli@gmail.com>
> > wrote:
> >
> > > Il ven 12 ott 2018, 23:17 Benjamin Reed <breed@apache.org> ha scritto:
> > >
> > > > i think the unique port assignment (d) is more problematic than it
> > > > appears. there is a race between finding a free port and actually
> > > > grabbing it. i think that contributes to the flakiness.
> > > >
> > >
> > > This is very hard to solve for our test cases, because we need to build
> > > configs before starting the groups of servers.
> > > For tests in single server it will be easier, you just have to start
> the
> > > server on port zero, get the port and the create client configs.
> > > I don't know how much it will be worth
> > >
> > > Enrico
> > >
> > >
> > > > ben
> > > > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <andor@apache.org>
> wrote:
> > > > >
> > > > > That is a completely valid point. I started to investigate flakies
> > for
> > > > exactly the same reason, if you remember the thread that I started a
> > > while
> > > > ago. It was later abandoned unfortunately, because I’ve run into a
> few
> > > > issues:
> > > > >
> > > > > - We nailed down that in order to release 3.5 stable, we have to
> make
> > > > sure it’s not worse than 3.4 by comparing the builds: but these
> builds
> > > are
> > > > not comparable, because 3.4 tests running single threaded while 3.5
> > > > multithreaded showing problems which might also exist on 3.4,
> > > > >
> > > > > - Neither of them running C++ tests for some reason, but that’s
not
> > > > really an issue here,
> > > > >
> > > > > - Looks like tests on 3.5 is just as solid as on 3.4, because
> running
> > > > them on a dedicated, single threaded environment show almost all
> tests
> > > > succeeding,
> > > > >
> > > > > - I think the root cause of failing unit tests could be one (or
> more)
> > > of
> > > > the following:
> > > > >         a) Environmental: Jenkins slave gets overloaded with other
> > > > builds and multithreaded test running makes things even worse:
> starving
> > > JDK
> > > > threads and ZK instances (both clients and servers) are unable to
> > operate
> > > > >         b) Conceptional: ZK unit tests were not designed to run on
> > > > multiple threads: I investigated the unique port assignment feature
> > which
> > > > is looking good, but there could be other possible gaps which makes
> > them
> > > > unreliable when running simultaneously.
> > > > >         c) Bad testing: testing ZK in the wrong way, making bad
> > > > assumption (e.g. not syncing clients), etc.
> > > > >         d) Bug in the server.
> > > > >
> > > > > I feel that finding case d) with these tests is super hard,
> because a
> > > > test report doesn’t give any information on what could go wrong with
> > > > ZooKeeper. More or less guessing is your only option.
> > > > >
> > > > > Finding c) is a little bit easier, I’m trying to submit patches
on
> > them
> > > > and hopefully making some progress.
> > > > >
> > > > > The huge pain in the arse though are a) and b): people desperately
> > keep
> > > > commenting “please retest this” on github to get a green build while
> > > > testing is going in a direction to hide real problems: I mean people
> > > > started not to care about a failing build, because “it must be some
> > flaky
> > > > unrelated to my patch”. Which is bad, but the shame is it’s true 90%
> > > > percent of cases.
> > > > >
> > > > > I’m just trying to find some ways - besides fixing c) and d)
> flakies
> > -
> > > > to get more reliable and more informative Jenkins builds. Don’t want
> to
> > > > make a huge turnaround, but I think if we can get a significantly
> more
> > > > reliable build for the price of slightly longer build time running
> on 4
> > > > threads instead of 8, I say let’s do it.
> > > > >
> > > > > As always, any help from the community is more than welcome and
> > > > appreciated.
> > > > >
> > > > > Thanks,
> > > > > Andor
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > > On 2018. Oct 12., at 16:52, Patrick Hunt <phunt@apache.org>
> wrote:
> > > > > >
> > > > > > iirc the number of threads was increased to improve performance.
> > > > Reducing
> > > > > > is fine, but do we understand why it's failing? Perhaps it's
> > finding
> > > > real
> > > > > > issues as a result of the artificial concurrency/load.
> > > > > >
> > > > > > Patrick
> > > > > >
> > > > > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar
> > > > <andor@cloudera.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > >> Thanks for the feedback.
> > > > > >> I'm running a few tests now: branch-3.5 on 2 threads and
trunk
> on
> > 4
> > > > threads
> > > > > >> to see what's the impact on the build time.
> > > > > >>
> > > > > >> Github PR job is hard to configure, because its settings
are
> hard
> > > > coded
> > > > > >> into a shell script in the codebase. I have to open PR for
that.
> > > > > >>
> > > > > >> Andor
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
> > > > > >> nkalmar@cloudera.com.invalid> wrote:
> > > > > >>
> > > > > >>> +1, running the tests locally with 1 thread always passes
> (well,
> > I
> > > > run it
> > > > > >>> about 5 times, but still)
> > > > > >>> On the other hand, running it on 8 threads yields similarly
> flaky
> > > > results
> > > > > >>> as Apache runs. (Although it is much faster, but if
we have to
> > run
> > > > 6-8-10
> > > > > >>> times sometimes to get a green run...)
> > > > > >>>
> > > > > >>> Norbert
> > > > > >>>
> > > > > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli <
> > > eolivelli@gmail.com
> > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> +1
> > > > > >>>>
> > > > > >>>> Enrico
> > > > > >>>>
> > > > > >>>> Il ven 12 ott 2018, 13:52 Andor Molnar <andor@apache.org>
ha
> > > > scritto:
> > > > > >>>>
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> What do you think of changing number of threads
running unit
> > > tests
> > > > in
> > > > > >>>>> Jenkins from current 8 to 4 or even 2?
> > > > > >>>>>
> > > > > >>>>> Running unit tests inside Cloudera environment
on a single
> > thread
> > > > > >> shows
> > > > > >>>> the
> > > > > >>>>> builds much more stable. That would be probably
too slow, but
> > > maybe
> > > > > >>>> running
> > > > > >>>>> at least less threads would improve the situation.
> > > > > >>>>>
> > > > > >>>>> It's getting very annoying that I cannot get
a green build on
> > > > GitHub
> > > > > >>> with
> > > > > >>>>> only a few retests.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Andor
> > > > > >>>>>
> > > > > >>>> --
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> -- Enrico Olivelli
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message