From dev-return-74526-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Sat Oct 13 13:59:31 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B5F70180638 for ; Sat, 13 Oct 2018 13:59:30 +0200 (CEST) Received: (qmail 10217 invoked by uid 500); 13 Oct 2018 11:59:29 -0000 Mailing-List: contact dev-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list dev@zookeeper.apache.org Received: (qmail 10203 invoked by uid 99); 13 Oct 2018 11:59:29 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Oct 2018 11:59:28 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8A6B21805A0 for ; Sat, 13 Oct 2018 11:59:28 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.888 X-Spam-Level: * X-Spam-Status: No, score=1.888 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id MySLA1mowLzy for ; Sat, 13 Oct 2018 11:59:26 +0000 (UTC) Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EEB3A5F24A for ; Sat, 13 Oct 2018 11:59:25 +0000 (UTC) Received: by mail-wr1-f49.google.com with SMTP id w5-v6so16110917wrt.2 for ; Sat, 13 Oct 2018 04:59:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=RZDsQ05jATMbQXDet/KjGBikCYNRWRI2XHPgmRgul0w=; b=vKr5yEQFumUAHtWpIYeNo/fpIIslITc2kvyB7Rky4by9Vg1Z6jqjS+kP0CAiicbdFF gb5VE4EyIDDvvtBDTEp1udK+7w5Z+xx6e+TDYJCHJoXiV0cGM/6Y3kLKqYk5fkArlzrB ZmwfwJsqALmWaH2mpKGy6EThfGNLQRamaDyY13C3zy0wP2I4h0qCV06kE8up6M3/J2fs yNVf5IAQCinek34WeL6KCYAvTyFM8PN0EjTijIgZTIC5oGUKoa5SXMYeK3tYJF+tne95 fCsdx21wqanRs6y8mBlfk0vNnssvXdW7oXkRjK357ZbRE/VOMRu42g2UL+y28ZkuoEKC jClQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=RZDsQ05jATMbQXDet/KjGBikCYNRWRI2XHPgmRgul0w=; b=HKOEaUgwjLbwsqh3cakpNlEi0/xxXAJJKXD7wLaSts4y9hp3et+fVdKpZ56xcCj3PL q2+x9zz2SyAZgLcOyi2bhBiKNePj1uRUU5cxqa1B3/+SfKmoa0d4IqUYp251ZxBjBZii XkPs0Y/GdEiwZsy3+Zs7F0rJeYZXtW2K/q7GWVG6HTTyKYl4B70v2yX7PQNwQsZJiGX+ MF5IorlSJdZ3V4ay9rmwl8RxdX+Ih+4e772sERdBm5Vrmc4I3KGHgoheMI8n4nRyGF4P Z5HjZ+bH3Ro0kNGp81myg+ePxh67PElEinZxQLRsu0rCadTn4e3exc3AqaymV7t+kegG KtBg== X-Gm-Message-State: ABuFfoho7pJiJI7TmqbLk+blDApsYzGV1nBdwaDm/z6MfbqU651Fgcvy Xkc657RzN6re1DOY7k4g9fTUQhy8shFMYRCdx91WPw== X-Google-Smtp-Source: ACcGV61iSxjuds6WoeX1IVbdZF3Vjlt5bnfNUJrtlSTR81CwdoCgDpSbR0mcyv7NdZtATet/I/kvCZFt2RlQHpMsflQ= X-Received: by 2002:adf:f548:: with SMTP id j8-v6mr8986825wrp.241.1539431964320; Sat, 13 Oct 2018 04:59:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Enrico Olivelli Date: Sat, 13 Oct 2018 13:59:10 +0200 Message-ID: Subject: Re: Decrease number of threads in Jenkins builds to reduce flakyness To: dev@zookeeper.apache.org Content-Type: multipart/alternative; boundary="00000000000063a85b05781aef0a" --00000000000063a85b05781aef0a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Il ven 12 ott 2018, 23:17 Benjamin Reed ha scritto: > i think the unique port assignment (d) is more problematic than it > appears. there is a race between finding a free port and actually > grabbing it. i think that contributes to the flakiness. > This is very hard to solve for our test cases, because we need to build configs before starting the groups of servers. For tests in single server it will be easier, you just have to start the server on port zero, get the port and the create client configs. I don't know how much it will be worth Enrico > ben > On Fri, Oct 12, 2018 at 8:50 AM Andor Molnar wrote: > > > > That is a completely valid point. I started to investigate flakies for > exactly the same reason, if you remember the thread that I started a whil= e > ago. It was later abandoned unfortunately, because I=E2=80=99ve run into = a few > issues: > > > > - We nailed down that in order to release 3.5 stable, we have to make > sure it=E2=80=99s not worse than 3.4 by comparing the builds: but these b= uilds are > not comparable, because 3.4 tests running single threaded while 3.5 > multithreaded showing problems which might also exist on 3.4, > > > > - Neither of them running C++ tests for some reason, but that=E2=80=99s= not > really an issue here, > > > > - Looks like tests on 3.5 is just as solid as on 3.4, because running > them on a dedicated, single threaded environment show almost all tests > succeeding, > > > > - I think the root cause of failing unit tests could be one (or more) o= f > the following: > > a) Environmental: Jenkins slave gets overloaded with other > builds and multithreaded test running makes things even worse: starving J= DK > threads and ZK instances (both clients and servers) are unable to operate > > b) Conceptional: ZK unit tests were not designed to run on > multiple threads: I investigated the unique port assignment feature which > is looking good, but there could be other possible gaps which makes them > unreliable when running simultaneously. > > c) Bad testing: testing ZK in the wrong way, making bad > assumption (e.g. not syncing clients), etc. > > d) Bug in the server. > > > > I feel that finding case d) with these tests is super hard, because a > test report doesn=E2=80=99t give any information on what could go wrong w= ith > ZooKeeper. More or less guessing is your only option. > > > > Finding c) is a little bit easier, I=E2=80=99m trying to submit patches= on them > and hopefully making some progress. > > > > The huge pain in the arse though are a) and b): people desperately keep > commenting =E2=80=9Cplease retest this=E2=80=9D on github to get a green = build while > testing is going in a direction to hide real problems: I mean people > started not to care about a failing build, because =E2=80=9Cit must be so= me flaky > unrelated to my patch=E2=80=9D. Which is bad, but the shame is it=E2=80= =99s true 90% > percent of cases. > > > > I=E2=80=99m just trying to find some ways - besides fixing c) and d) fl= akies - > to get more reliable and more informative Jenkins builds. Don=E2=80=99t w= ant to > make a huge turnaround, but I think if we can get a significantly more > reliable build for the price of slightly longer build time running on 4 > threads instead of 8, I say let=E2=80=99s do it. > > > > As always, any help from the community is more than welcome and > appreciated. > > > > Thanks, > > Andor > > > > > > > > > > > On 2018. Oct 12., at 16:52, Patrick Hunt wrote: > > > > > > iirc the number of threads was increased to improve performance. > Reducing > > > is fine, but do we understand why it's failing? Perhaps it's finding > real > > > issues as a result of the artificial concurrency/load. > > > > > > Patrick > > > > > > On Fri, Oct 12, 2018 at 7:12 AM Andor Molnar > > > > wrote: > > > > > >> Thanks for the feedback. > > >> I'm running a few tests now: branch-3.5 on 2 threads and trunk on 4 > threads > > >> to see what's the impact on the build time. > > >> > > >> Github PR job is hard to configure, because its settings are hard > coded > > >> into a shell script in the codebase. I have to open PR for that. > > >> > > >> Andor > > >> > > >> > > >> > > >> On Fri, Oct 12, 2018 at 2:46 PM, Norbert Kalmar < > > >> nkalmar@cloudera.com.invalid> wrote: > > >> > > >>> +1, running the tests locally with 1 thread always passes (well, I > run it > > >>> about 5 times, but still) > > >>> On the other hand, running it on 8 threads yields similarly flaky > results > > >>> as Apache runs. (Although it is much faster, but if we have to run > 6-8-10 > > >>> times sometimes to get a green run...) > > >>> > > >>> Norbert > > >>> > > >>> On Fri, Oct 12, 2018 at 2:05 PM Enrico Olivelli > > > >>> wrote: > > >>> > > >>>> +1 > > >>>> > > >>>> Enrico > > >>>> > > >>>> Il ven 12 ott 2018, 13:52 Andor Molnar ha > scritto: > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> What do you think of changing number of threads running unit test= s > in > > >>>>> Jenkins from current 8 to 4 or even 2? > > >>>>> > > >>>>> Running unit tests inside Cloudera environment on a single thread > > >> shows > > >>>> the > > >>>>> builds much more stable. That would be probably too slow, but may= be > > >>>> running > > >>>>> at least less threads would improve the situation. > > >>>>> > > >>>>> It's getting very annoying that I cannot get a green build on > GitHub > > >>> with > > >>>>> only a few retests. > > >>>>> > > >>>>> Regards, > > >>>>> Andor > > >>>>> > > >>>> -- > > >>>> > > >>>> > > >>>> -- Enrico Olivelli > > >>>> > > >>> > > >> > > > --=20 -- Enrico Olivelli --00000000000063a85b05781aef0a--