zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fangmin Lv <lvfang...@gmail.com>
Subject Re: Decrease number of threads in Jenkins builds to reduce flakyness
Date Sun, 14 Oct 2018 23:58:13 GMT
Internally, we also did some works to reduce the flaky, here are the main
things we've done:

* using retry rule to retry in case the zk client lost it's connection,
this could happen if the quorum tests is running on unstable environment
and the leader election happened.
* using random port instead of sequentially to avoid the port racing when
running tests concurrently
* changing tests to avoid using the same test path when creating/deleting
nodes

These greatly reduced the flaky internally, we should try those if we're
seeing similar issues in the Jenkins.

Fangmin

On Sat, Oct 13, 2018 at 10:48 AM Bogdan Kanivets <bkanivets@gmail.com>
wrote:

> I've looked into flakiness couple months ago (special attention on
> testManyChildWatchersAutoReset). In my opinion the problem is a) and c).
> Unfortunately I don't have data to back this claim.
>
> I don't remember seeing many 'port binding' exceptions. Unless 'port
> assignment' issue manifested as some other exception.
>
> Before decreasing number of threads I think more data should be
> collected/visualized
>
> 1) Flaky dashboard is great, but we should add another report that maps
> 'error causes' to builds/tests
> 2) Flaky dash can be extended to save more history (for example like this
> https://www.chromium.org/developers/testing/flakiness-dashboard)
> 3) PreCommit builds should be included in dashboard
> 4) We should have a common clean benchmark. For example - take
> AWS t3.xlarge instance with set linux distro, jvm, zk commit sha and run
> tests (current 8 threads) for 8 hours with 1 min cooldown.
>
> Due to recent employment change, I got sidetracked, but I really want to
> get to the bottom of this.
> I'm going to setup 4) and report results to this mailing list. Also willing
> to work on other items.
>
>
>
>
>
>
> On Sat, Oct 13, 2018 at 4:59 AM Enrico Olivelli <eolivelli@gmail.com>
> wrote:
>
> > Il ven 12 ott 2018, 23:17 Benjamin Reed <breed@apache.org> ha scritto:
> >
> > > i think the unique port assignment (d) is more problematic than it
> > > appears. there is a race between finding a free port and actually
> > > grabbing it. i think that contributes to the flakiness.
> > >
> >
> > This is very hard to solve for our test cases, because we need to build
> > configs before starting the groups of servers.
> > For tests in single server it will be easier, you just have to start the
> > server on port zero, get the port and the create client configs.
> > I don't know how much it will be worth
> >
> > Enrico
> >
> >
> > > ben
> > > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar <andor@apache.org> wrote:
> > > >
> > > > That is a completely valid point. I started to investigate flakies
> for
> > > exactly the same reason, if you remember the thread that I started a
> > while
> > > ago. It was later abandoned unfortunately, because I’ve run into a few
> > > issues:
> > > >
> > > > - We nailed down that in order to release 3.5 stable, we have to make
> > > sure it’s not worse than 3.4 by comparing the builds: but these builds
> > are
> > > not comparable, because 3.4 tests running single threaded while 3.5
> > > multithreaded showing problems which might also exist on 3.4,
> > > >
> > > > - Neither of them running C++ tests for some reason, but that’s not
> > > really an issue here,
> > > >
> > > > - Looks like tests on 3.5 is just as solid as on 3.4, because running
> > > them on a dedicated, single threaded environment show almost all tests
> > > succeeding,
> > > >
> > > > - I think the root cause of failing unit tests could be one (or more)
> > of
> > > the following:
> > > >         a) Environmental: Jenkins slave gets overloaded with other
> > > builds and multithreaded test running makes things even worse: starving
> > JDK
> > > threads and ZK instances (both clients and servers) are unable to
> operate
> > > >         b) Conceptional: ZK unit tests were not designed to run on
> > > multiple threads: I investigated the unique port assignment feature
> which
> > > is looking good, but there could be other possible gaps which makes
> them
> > > unreliable when running simultaneously.
> > > >         c) Bad testing: testing ZK in the wrong way, making bad
> > > assumption (e.g. not syncing clients), etc.
> > > >         d) Bug in the server.
> > > >
> > > > I feel that finding case d) with these tests is super hard, because a
> > > test report doesn’t give any information on what could go wrong with
> > > ZooKeeper. More or less guessing is your only option.
> > > >
> > > > Finding c) is a little bit easier, I’m trying to submit patches on
> them
> > > and hopefully making some progress.
> > > >
> > > > The huge pain in the arse though are a) and b): people desperately
> keep
> > > commenting “please retest this” on github to get a green build while
> > > testing is going in a direction to hide real problems: I mean people
> > > started not to care about a failing build, because “it must be some
> flaky
> > > unrelated to my patch”. Which is bad, but the shame is it’s true 90%
> > > percent of cases.
> > > >
> > > > I’m just trying to find some ways - besides fixing c) and d) flakies
> -
> > > to get more reliable and more informative Jenkins builds. Don’t want to
> > > make a huge turnaround, but I think if we can get a significantly more
> > > reliable build for the price of slightly longer build time running on 4
> > > threads instead of 8, I say let’s do it.
> > > >
> > > > As always, any help from the community is more than welcome and
> > > appreciated.
> > > >
> > > > Thanks,
> > > > Andor
> > > >
> > > >
> > > >
> > > >
> > > > > On 2018. Oct 12., at 16:52, Patrick Hunt <phunt@apache.org>
wrote:
> > > > >
> > > > > iirc the number of threads was increased to improve performance.
> > > Reducing
> > > > > is fine, but do we understand why it's failing? Perhaps it's
> finding
> > > real
> > > > > issues as a result of the artificial concurrency/load.
> > > > >
> > > > > Patrick
> > > > >
> > > > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar
> > > <andor@cloudera.com.invalid>
> > > > > wrote:
> > > > >
> > > > >> Thanks for the feedback.
> > > > >> I'm running a few tests now: branch-3.5 on 2 threads and trunk
on
> 4
> > > threads
> > > > >> to see what's the impact on the build time.
> > > > >>
> > > > >> Github PR job is hard to configure, because its settings are
hard
> > > coded
> > > > >> into a shell script in the codebase. I have to open PR for that.
> > > > >>
> > > > >> Andor
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar <
> > > > >> nkalmar@cloudera.com.invalid> wrote:
> > > > >>
> > > > >>> +1, running the tests locally with 1 thread always passes
(well,
> I
> > > run it
> > > > >>> about 5 times, but still)
> > > > >>> On the other hand, running it on 8 threads yields similarly
flaky
> > > results
> > > > >>> as Apache runs. (Although it is much faster, but if we have
to
> run
> > > 6-8-10
> > > > >>> times sometimes to get a green run...)
> > > > >>>
> > > > >>> Norbert
> > > > >>>
> > > > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli <
> > eolivelli@gmail.com
> > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>>> +1
> > > > >>>>
> > > > >>>> Enrico
> > > > >>>>
> > > > >>>> Il ven 12 ott 2018, 13:52 Andor Molnar <andor@apache.org>
ha
> > > scritto:
> > > > >>>>
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> What do you think of changing number of threads running
unit
> > tests
> > > in
> > > > >>>>> Jenkins from current 8 to 4 or even 2?
> > > > >>>>>
> > > > >>>>> Running unit tests inside Cloudera environment on
a single
> thread
> > > > >> shows
> > > > >>>> the
> > > > >>>>> builds much more stable. That would be probably too
slow, but
> > maybe
> > > > >>>> running
> > > > >>>>> at least less threads would improve the situation.
> > > > >>>>>
> > > > >>>>> It's getting very annoying that I cannot get a green
build on
> > > GitHub
> > > > >>> with
> > > > >>>>> only a few retests.
> > > > >>>>>
> > > > >>>>> Regards,
> > > > >>>>> Andor
> > > > >>>>>
> > > > >>>> --
> > > > >>>>
> > > > >>>>
> > > > >>>> -- Enrico Olivelli
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> > --
> >
> >
> > -- Enrico Olivelli
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message