zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Norbert Kalmar <nkal...@cloudera.com.INVALID>
Subject Re: Glide path to getting 3.5.x out of beta
Date Thu, 29 Nov 2018 09:16:17 GMT
+1 (non-binding) on the "dockerisation" of the tests!

I agree though, that the change needs to be voted first by the community,
and of course agreed on by committers / PMC members.

Let's see the reactions on this thread. But if not enough votes come in,
you could try starting a new thread, and leading the subject with [VOTE] or
[SUGGESTION] to draw attention.
And just write down your findings as in your last letter, Michael.

Thank you for your help and work, sounds great!

Regards,
Norbert

On Thu, Nov 29, 2018 at 8:56 AM Enrico Olivelli <eolivelli@gmail.com> wrote:

> Great work Michael,
>
> I am totally +1 on using docker for network isolation
>
> I think that Apache CI may allow out-of-the-box execution in Docker
> containers, in fact we have the "CloudBees Docker Custom Build
> Environment Plugin"
> We can use a public image or provide on the repo a Dockerfile.
>
> In my company we are taking another approach, we launch a script which
> sets up the container(s) and that run the tests.
>
> The former approach (built in jenkins) is easy to try, the latter is
> more complex but maybe you already have some script,
> but our automatic QA script is quite complex and needs a lot of third
> party tool
>
> when we will be on Maven we would not need external
> findbugs,forrest,ant.....
>
> ${ANT_HOME}/bin/ant \
>         -Dpatch.file=foobar \
>         -Dscratch.dir=$PATCH_DIR \
>         -Dps.cmd=/bin/ps \
>         -Dwget.cmd=/usr/bin/wget \
>         -Djiracli.cmd=/home/jenkins/tools/jiracli/latest/jira.sh \
>         -Dgit.cmd=/usr/bin/git \
>         -Dgrep.cmd=/bin/grep \
>         -Dpatch.cmd=/usr/bin/patch \
>         -Dfindbugs.home=/home/jenkins/tools/findbugs/latest/ \
>         -Dforrest.home=/home/jenkins/tools/forrest/latest/ \
>         -Djira.passwd=no-shown-here \
>         -Djava5.home=/home/jenkins/tools/java5/latest/ \
>         -Dcurl.cmd=/usr/bin/curl \
>         -Dtest.junit.maxmem=2g \
>         qa-test-pullrequest
>
> I am not a committer, but I have write access to Apache CI, so if
> ZooKeeper PMCs agree on trying the docker config on the CI jobs I will
> be happy to try
>
> Enrico
>
> Il giorno gio 29 nov 2018 alle ore 06:27 Michael K. Edwards
> <m.k.edwards@gmail.com> ha scritto:
> >
> > With the use of a Docker container (to prevent port collisions) and a
> > stack of cleanups to test code, I've made some progress towards
> > reliable test runs in our environment.
> > (https://github.com/mkedwards/zookeeper/commits/rollup-3.5, if you're
> > curious.)  The list below consists of the "top 40" slowest individual
> > tests.  Note that several appear multiple times, because of the
> > inclusion of classes containing slow tests in NioNettySuiteTest and
> > NettyNettySuiteTest.
> >
> > I'm somewhat hesitant to undertake further overhauls of the test
> > suite, because I've already found myself having to make the kinds of
> > changes that tend to be uphill battles, code-review-wise -- especially
> > coming from an outsider.
> >
> https://github.com/mkedwards/zookeeper/commit/e02eb705c6550f51ebb860a474ce711ec68c7a24
> > is an example.  If a Zookeeper committer is interested in working with
> > me on this, maybe email me?  Otherwise, I'll try to keep this branch
> > rebased regularly, and hammer on the remaining flaky tests to see what
> > I can learn.
> >
> > $ grep ' Ran ' build.log | sort -n -t '[' -r -k 4 | head -40
> >     [junit]  [39343@1f20a9d731ad] Ran [68.657]
> > org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortLeader
> > ... OK
> >     [junit]  [44264@1f20a9d731ad] Ran [68.607]
> >
> org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortFollower
> > ... OK
> >     [junit]  [59151@1f20a9d731ad] Ran [67.535]
> > org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortLeader
> > ... OK
> >     [junit]  [59151@1f20a9d731ad] Ran [67.397]
> >
> org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortFollower
> > ... OK
> >     [junit]  [24817@1f20a9d731ad] Ran [66.555]
> >
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest:testNextConfigUnreachable
> > ... OK
> >     [junit]  [39343@1f20a9d731ad] Ran [66.345]
> >
> org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortFollower
> > ... OK
> >     [junit]  [44264@1f20a9d731ad] Ran [65.382]
> > org.apache.zookeeper.test.ReconfigTest:testPortChangeToBlockedPortLeader
> > ... OK
> >     [junit]  [33311@1f20a9d731ad] Ran [64.39]
> >
> org.apache.zookeeper.test.DisconnectedWatcherTest:testManyChildWatchersAutoReset
> > ... OK
> >     [junit]  [40332@1f20a9d731ad] Ran [60.907]
> > org.apache.zookeeper.test.AsyncHammerTest:testHammer ... OK
> >     [junit]  [26094@1f20a9d731ad] Ran [58.559]
> >
> org.apache.zookeeper.server.quorum.StandaloneDisabledTest:startSingleServerTest
> > ... OK
> >     [junit]  [34470@1f20a9d731ad] Ran [52.229]
> >
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest:testResyncByTxnlogThenDiffAfterFollowerCrashes
> > ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [49.956]
> >
> org.apache.zookeeper.server.quorum.QuorumSSLTest:testHostnameVerificationWithInvalidIPAddress
> > ... OK
> >     [junit]  [18406@1f20a9d731ad] Ran [48.65]
> >
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest:testFailedTxnAsPartOfQuorumLoss
> > ... OK
> >     [junit]  [34470@1f20a9d731ad] Ran [48.582]
> >
> org.apache.zookeeper.test.FollowerResyncConcurrencyTest:testResyncBySnapThenDiffAfterFollowerCrashes
> > ... OK
> >     [junit]  [18406@1f20a9d731ad] Ran [47.115]
> >
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest:testEarlyLeaderAbandonment
> > ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [46.184]
> > org.apache.zookeeper.server.quorum.QuorumSSLTest:testCipherSuites ...
> > OK
> >     [junit]  [28926@1f20a9d731ad] Ran [45.764]
> > org.apache.zookeeper.test.AsyncHammerTest:testHammer ... OK
> >     [junit]  [39059@1f20a9d731ad] Ran [44.87]
> > org.apache.zookeeper.test.AsyncHammerTest:testHammer ... OK
> >     [junit]  [39343@1f20a9d731ad] Ran [44.588]
> > org.apache.zookeeper.test.ReconfigTest:testPortChange ... OK
> >     [junit]  [50794@1f20a9d731ad] Ran [43.504]
> > org.apache.zookeeper.test.QuorumZxidSyncTest:testBehindLeader ... OK
> >     [junit]  [44264@1f20a9d731ad] Ran [43.183]
> > org.apache.zookeeper.test.ReconfigTest:testPortChange ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [42.234]
> >
> org.apache.zookeeper.server.quorum.QuorumSSLTest:testHostnameVerificationWithInvalidHostname
> > ... OK
> >     [junit]  [59151@1f20a9d731ad] Ran [41.083]
> > org.apache.zookeeper.test.ReconfigTest:testPortChange ... OK
> >     [junit]  [48135@1f20a9d731ad] Ran [40.15]
> > org.apache.zookeeper.test.QuorumHammerTest:testHammerBasic ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [39.466]
> >
> org.apache.zookeeper.server.quorum.QuorumSSLTest:testCertificateRevocationList
> > ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [39.114]
> > org.apache.zookeeper.server.quorum.QuorumSSLTest:testQuorumSSL ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [38.894]
> > org.apache.zookeeper.server.quorum.QuorumSSLTest:testOCSP ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [37.875]
> >
> org.apache.zookeeper.server.quorum.QuorumSSLTest:testHostnameVerificationWithInvalidIpAddressAndInvalidHostname
> > ... OK
> >     [junit]  [19354@1f20a9d731ad] Ran [37.469]
> > org.apache.zookeeper.server.quorum.QuorumSSLTest:testProtocolVersion
> > ... OK
> >     [junit]  [24817@1f20a9d731ad] Ran [36.251]
> >
> org.apache.zookeeper.server.quorum.ReconfigRecoveryTest:testCurrentObserverIsParticipantInNewConfig
> > ... OK
> >     [junit]  [28926@1f20a9d731ad] Ran [34.052]
> > org.apache.zookeeper.test.AsyncHammerTest:testObserversHammer ... OK
> >     [junit]  [39059@1f20a9d731ad] Ran [32.643]
> > org.apache.zookeeper.test.AsyncHammerTest:testObserversHammer ... OK
> >     [junit]  [50794@1f20a9d731ad] Ran [32.521]
> > org.apache.zookeeper.test.QuorumZxidSyncTest:testLateLogs ... OK
> >     [junit]  [18406@1f20a9d731ad] Ran [30.15]
> >
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest:testBadPeerAddressInQuorum
> > ... OK
> >     [junit]  [26094@1f20a9d731ad] Ran [30.067]
> > org.apache.zookeeper.server.quorum.StandaloneDisabledTest:startObserver
> > ... OK
> >     [junit]  [18406@1f20a9d731ad] Ran [29.979]
> >
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest:testQuorumPeerExitTime
> > ... OK
> >     [junit]  [22552@1f20a9d731ad] Ran [29.419]
> >
> org.apache.zookeeper.server.quorum.ReconfigFailureCasesTest:testObserverToParticipantConversionFails
> > ... OK
> >     [junit]  [9698@1f20a9d731ad] Ran [28.838]
> > org.apache.zookeeper.server.ZxidRolloverTest:testRolloverThenRestart
> > ... OK
> >     [junit]  [18406@1f20a9d731ad] Ran [28.075]
> >
> org.apache.zookeeper.server.quorum.QuorumPeerMainTest:testInconsistentDueToNewLeaderOrder
> > ... OK
> >     [junit]  [44264@1f20a9d731ad] Ran [27.39]
> > org.apache.zookeeper.test.ReconfigTest:testRemoveAddTwo ... OK
> > On Fri, Nov 23, 2018 at 7:41 AM Michael K. Edwards
> > <m.k.edwards@gmail.com> wrote:
> > >
> > > Thanks!  I assigned 2778 to myself.
> > >
> > > ZOOKEEPER-2778:  A port to the master branch of the current state of
> > > my patch is in https://github.com/apache/zookeeper/pull/719.  Be aware
> > > that there are a couple of touches to the code needed in 3.5 that
> > > aren't needed in master:
> > >
> https://github.com/apache/zookeeper/pull/707/files#diff-7a209d890686bcba351d758b64b22a7dR413
> > > and
> https://github.com/apache/zookeeper/pull/707/files#diff-b2dd09c58f745da275fee3c6d8681503R974
> > > (both of these are obviated by cleanups that have taken place on
> > > master).
> > >
> > > ZOOKEEPER-1636:  By "clean" I just mean "in isolation"; previously I
> > > had stacked this patch in a branch on top of the 2778 work.
> > >
> > > ZOOKEEPER-1818:  PR #714 is a port of Fangmin's patch to 3.5 (which
> > > split off before the refactor from termCondition to getVoteTracker).
> > > PR #718 is Fangmin's patch unchanged, just cherry-picked onto current
> > > master and poked until we got a green Jenkins build.
> > >
> > > "Address already in use":
> > >
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/consoleText
> > > (search for BindException).  You generally have to look at the raw
> > > consoleText in order to find these.  I don't see any way of getting at
> > > the untruncated text for
> > >
> https://builds.apache.org/job/ZooKeeper_branch35_jdk8/1195/testReport/junit/org.apache.zookeeper.server.quorum/StandaloneDisabledTest/startSingleServerTest/
> > > , but I suspect there's a similar BindException hidden inside
> > > "...[truncated 395348 chars]..."
> > >
> > > On Fri, Nov 23, 2018 at 1:50 AM Andor Molnar <andor@apache.org> wrote:
> > > >
> > > > Hi Michael,
> > > >
> > > > I added you to the contributors list in Jira, now you can assign
> tickets to yourself.
> > > >
> > > > 3.5
> > > > ~~~
> > > > ZOOKEEPER-2778 - I already accepted the patch, but I’d like to
> kindly ask you to create a separate pull request for the master branch
> which I can backport to 3.5 after committing it. This will help us follow
> the standard procedure of making changes.
> > > >
> > > > ZOOKEEPER-1636 - Thanks for picking it up, I’ll review your patch
> shortly. Btw I’m not sure what do you mean by “clean” pull request.
> > > >
> > > > ZOOKEEPER-1818 - This issue is already taken care by Fangmin (PR
> #703), why have you created the new PR?
> > > >
> > > > Flakies
> > > > ~~~~~~~
> > > > We’re already aware of the downside of PortAssignment class, but
> haven’t really seen too many "Address already in use” problems in tests.
> (Except in Java 11 builds, but those are unrelated) Would you please
> provide some evidence about your findings with links to builds that you’re
> talking about and specific error messages?
> > > >
> > > > Thanks,
> > > > Andor
> > > >
> > > >
> > > >
> > > >
> > > > > On 2018. Nov 22., at 23:20, Michael K. Edwards <
> m.k.edwards@gmail.com> wrote:
> > > > >
> > > > > For what it's worth, builds 2732 and 2733 ran concurrently on H19,
> and
> > > > > both failed for what I think are resource-conflict reasons.  It
> would
> > > > > probably help to modify the PreCommit-ZOOKEEPER-github-pr-build
> queue
> > > > > so that it doesn't attempt concurrent builds on the same
> > > > > (uncontainerized) host.
> > > > > On Thu, Nov 22, 2018 at 1:44 PM Michael K. Edwards
> > > > > <m.k.edwards@gmail.com> wrote:
> > > > >>
> > > > >> Thanks for the guidance.  Feel free to assign ZOOKEEPER-2778
to
> me (I
> > > > >> don't seem to be able to do it myself).  I've updated that pull
> > > > >> request against 3.5 to address all reviewer comments.  When it
> looks
> > > > >> ready to land, I'll port it to master as well.
> > > > >>
> > > > >> I have updated ZOOKEEPER-1636 and ZOOKEEPER-1818 with clean pull
> > > > >> requests based on Thawan's and Fangmin's patches.  I'll poke
at
> them
> > > > >> until they build green, and try to handle anything reviewers
> bring up.
> > > > >>
> > > > >> With regard to flaky tests:  a fair fraction of spurious test
> failures
> > > > >> appear to result from failure to bind a dynamically-assigned
> > > > >> client/election/quorum port.  The prevailing hypothesis is that
> > > > >> something else, running concurrently on the machine, is binding
> the
> > > > >> port in between the check in PortAssignment (which binds it,
to
> verify
> > > > >> that it's not otherwise in use, and then closes that socket to
> free it
> > > > >> again) and the subsequent use as a service port.  If that's the
> case,
> > > > >> then we could eliminate this class of test failures by running
the
> > > > >> tests inside a container (with a dedicated network namespace).
> Any
> > > > >> failures of this kind that persist in a containerized test setup
> are
> > > > >> the test fighting itself, not fighting unrelated concurrent
> processes.
> > > > >> On Thu, Nov 22, 2018 at 8:23 AM Andor Molnar <andor@cloudera.com>
> wrote:
> > > > >>>
> > > > >>> Hi Michael!
> > > > >>>
> > > > >>> Thanks for the great help to get 3.5 out of the door. We're
> getting closer with each commit.
> > > > >>>
> > > > >>> You asked a lot of questions in your email, which I'm trying
to
> answer, but I believe the best approach is to deal with one problem at a
> time. Especially in email communication is not ideal to mix different
> topics, because it makes things hard to follow.
> > > > >>>
> > > > >>> I focus on 3.5 release in this thread according to the subject.
> There's another thread btw I usually update every so often, but your list
> is pretty much accurate too. I use the following query for 3.5 blockers:
> > > > >>>
> > > > >>> project = ZooKeeper AND resolution = Unresolved AND fixVersion
=
> 3.5.5 AND priority in (blocker, critical) ORDER BY priority DESC, key ASC
> > > > >>>
> > > > >>> ZOOKEEPER-1818 - Fangmin is working on it and patch is available
> on github.
> > > > >>> ZOOKEEPER-2778 - You're working on it, patch is available.
You
> should assign the Jira to yourself to avoid somebody else picking it up.
> > > > >>> ZOOKEEPER-1636 - An ancient C issue which has patch available
in
> Jira. I'm planning to rebase it on master, but didn't have a chance yet.
> > > > >>>
> > > > >>> All of the others are Maven/Doc related which Tamas and Norbert
> are working on.
> > > > >>>
> > > > >>> Flaky tests are related, but we don't tackle it as a blocker
> issue. Here's the umbrella Jira that I've created to track the progress:
> > > > >>> https://issues.apache.org/jira/browse/ZOOKEEPER-3170
> > > > >>>
> > > > >>> Feel free to pick up any of the open ones or create new ones
if
> you think it's necessary. It's generally better to open individual Jiras
> for every issue you're working on and discuss the details in it. You can
> open an email thread too, if you feel convenient, but Jira is preferred.
> > > > >>>
> > > > >>> Preferred workflow is Open Jira -> GitHub PR -> Commit
to master
> -> Backport to 3.5/3.4 if necessary -> Close Jira.
> > > > >>>
> > > > >>> Thank you for your contribution again!
> > > > >>>
> > > > >>> Andor
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Thu, Nov 22, 2018 at 12:51 PM Michael K. Edwards <
> m.k.edwards@gmail.com> wrote:
> > > > >>>>
> > > > >>>> I think it's mostly a problem in CI, where other processes
on
> the same
> > > > >>>> machine may compete for the port range, producing spurious
> Jenkins
> > > > >>>> failures.  The only failures I'm seeing locally are unrelated
> SSL
> > > > >>>> issues.
> > > > >>>> On Thu, Nov 22, 2018 at 3:45 AM Enrico Olivelli <
> eolivelli@gmail.com> wrote:
> > > > >>>>>
> > > > >>>>> Il giorno gio 22 nov 2018 alle ore 12:44 Michael
K. Edwards
> > > > >>>>> <m.k.edwards@gmail.com> ha scritto:
> > > > >>>>>>
> > > > >>>>>> I'm glad to be able to help.
> > > > >>>>>>
> > > > >>>>>> It appears as though some of the "flaky tests"
result from
> another
> > > > >>>>>> process stealing a server port between the time
that it is
> assigned
> > > > >>>>>> (in org.apache.zookeeper.PortAssignment.unique())
and the
> time that it
> > > > >>>>>> is bound.
> > > > >>>>>
> > > > >>>>> You can try running tests using a single thread,
this will
> "mitigate"
> > > > >>>>> the problem a bit
> > > > >>>>>
> > > > >>>>> Enrico
> > > > >>>>>
> > > > >>>>> This happened, for example, in
> > > > >>>>>>
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/;
> > > > >>>>>> looking in the console text, I found:
> > > > >>>>>>
> > > > >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,336
[myid:] - INFO
> > > > >>>>>> [QuorumPeerListener:QuorumCnxManager$Listener@884]
- My
> election bind
> > > > >>>>>> port: localhost/127.0.0.1:19459
> > > > >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,337
[myid:] - INFO
> > > > >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457
> )(secure=disabled):NettyServerCnxnFactory@493]
> > > > >>>>>> - binding to port localhost/127.0.0.1:19466
> > > > >>>>>>     [exec]     [junit] 2018-11-22 00:18:30,337
[myid:] - ERROR
> > > > >>>>>> [QuorumPeer[myid=1](plain=/127.0.0.1:19457
> )(secure=disabled):NettyServerCnxnFactory@497]
> > > > >>>>>> - Error while reconfiguring
> > > > >>>>>>     [exec]     [junit]
> org.jboss.netty.channel.ChannelException:
> > > > >>>>>> Failed to bind to: localhost/127.0.0.1:19466
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263)
> > > > >>>>>>     [exec]     [junit] Caused by: java.net.BindException:
> Address
> > > > >>>>>> already in use
> > > > >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind0(Native
Method)
> > > > >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:433)
> > > > >>>>>>     [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:425)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>> sun.nio.ch
> .ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>> sun.nio.ch
> .ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > > > >>>>>>     [exec]     [junit] at
> > > > >>>>>>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > > > >>>>>>     [exec]     [junit] at
> java.lang.Thread.run(Thread.java:748)
> > > > >>>>>>
> > > > >>>>>> We currently log-and-swallow this exception (and
many others)
> down in
> > > > >>>>>> NettyServerCnxnFactory.reconfigure() and
> > > > >>>>>> NIOServerCnxnFactory.reconfigure(), which is
... not ideal.
> > > > >>>>>>
> > > > >>>>>> How should we handle a bind failure in the real
world?  Seems
> like we
> > > > >>>>>> ought to throw a BindException out at least as
far as the
> caller of
> > > > >>>>>> QuorumPeer.processReconfig().  That's either
> > > > >>>>>> Follower/Leader/Learner/Observer or FastLeaderElection.
> Presumably
> > > > >>>>>> they should immediately go read-only when they
can't bind the
> client
> > > > >>>>>> port?
> > > > >>>>>> On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli
<
> eolivelli@gmail.com> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Thank you very much Michael
> > > > >>>>>>> I am following and reviewing your patches
> > > > >>>>>>>
> > > > >>>>>>> Enrico
> > > > >>>>>>> Il giorno gio 22 nov 2018 alle ore 10:14
Michael K. Edwards
> > > > >>>>>>> <m.k.edwards@gmail.com> ha scritto:
> > > > >>>>>>>>
> > > > >>>>>>>> Hmm.  Jira's a bit of a boneyard, isn't
it?  And timeouts
> in flaky
> > > > >>>>>>>> tests are a problem.
> > > > >>>>>>>>
> > > > >>>>>>>> I scrubbed through the open bugs and
picked the ones that
> looked to me
> > > > >>>>>>>> like they might deserve attention for
3.5.5 or soon
> thereafter.
> > > > >>>>>>>> They're all on my watchlist:
> > > > >>>>>>>>
> https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC
> > > > >>>>>>>> (I'm not counting the Ant->Maven transition
in that, which
> I don't
> > > > >>>>>>>> know much about.)
> > > > >>>>>>>>
> > > > >>>>>>>> I'm trying out some more verbose logging
for the junit
> tests, to try
> > > > >>>>>>>> to understand test flakiness.  But the
Jenkins pre-commit
> pipeline
> > > > >>>>>>>> appears to be down?
> > > > >>>>>>>>
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/
> > > > >>>>>>>> On Wed, Nov 21, 2018 at 2:29 PM Michael
K. Edwards
> > > > >>>>>>>> <m.k.edwards@gmail.com> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> Looks like we're really close.  Can
I help?
> > > > >>>>>>>>>
> > > > >>>>>>>>> I think this is the list of release
blockers:
> > > > >>>>>>>>>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
> > > > >>>>>>>>>
> > > > >>>>>>>>> I currently see 7 issues in that
search, of which 4 are
> aspects of the
> > > > >>>>>>>>> ongoing switch from ant to maven.
 Setting that aside for
> the moment,
> > > > >>>>>>>>> there are 3 critical bugs:
> > > > >>>>>>>>>
> > > > >>>>>>>>> ZOOKEEPER-2778  Potential server
deadlock between follower
> sync with
> > > > >>>>>>>>> leader and follower receiving external
connection requests.
> > > > >>>>>>>>>
> > > > >>>>>>>>> ZOOKEEPER-1636  c-client crash when
zoo_amulti failed
> > > > >>>>>>>>>
> > > > >>>>>>>>> ZOOKEEPER-1818  Fix don't care for
trunk
> > > > >>>>>>>>>
> > > > >>>>>>>>> I put them in that order because
that's the order in which
> I've
> > > > >>>>>>>>> stacked the fixes in
> > > > >>>>>>>>> https://github.com/mkedwards/zookeeper/tree/branch-3.5.
> Then on top
> > > > >>>>>>>>> of that, I've updated the versions
of the external library
> > > > >>>>>>>>> dependencies I think it's important
to update: Jetty,
> Jackson, and
> > > > >>>>>>>>> BouncyCastle.  The result seems to
be a green build in
> Jenkins:
> > > > >>>>>>>>>
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/
> > > > >>>>>>>>>
> > > > >>>>>>>>> Are these fixes in principle landable
on the 3.5 branch,
> or do they
> > > > >>>>>>>>> have to go to master first?  Does
master need help to
> build green
> > > > >>>>>>>>> before these can land there?  Are
there other bugs that
> are similarly
> > > > >>>>>>>>> critical to fix, and not tagged for
3.5.5 in Jira?  Is
> there other
> > > > >>>>>>>>> testing that I can help with?  Are
more hands needed on
> the Maven
> > > > >>>>>>>>> work?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thanks for all the work that goes
into keeping Zookeeper
> healthy and
> > > > >>>>>>>>> advancing; it's a critical infrastructure
component in
> several systems
> > > > >>>>>>>>> I help develop and operate, and I
like being able to rely
> on it.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Cheers,
> > > > >>>>>>>>> - Michael
> > > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message