zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@cloudera.com.INVALID>
Subject Re: Glide path to getting 3.5.x out of beta
Date Thu, 22 Nov 2018 16:22:49 GMT
Hi Michael!

Thanks for the great help to get 3.5 out of the door. We're getting closer
with each commit.

You asked a lot of questions in your email, which I'm trying to answer, but
I believe the best approach is to deal with one problem at a time.
Especially in email communication is not ideal to mix different topics,
because it makes things hard to follow.

I focus on 3.5 release in this thread according to the subject. There's
another thread btw I usually update every so often, but your list is pretty
much accurate too. I use the following query for 3.5 blockers:

project = ZooKeeper AND resolution = Unresolved AND fixVersion = 3.5.5 AND
priority in (blocker, critical) ORDER BY priority DESC, key ASC

ZOOKEEPER-1818 - Fangmin is working on it and patch is available on github.
ZOOKEEPER-2778 - You're working on it, patch is available. You should
assign the Jira to yourself to avoid somebody else picking it up.
ZOOKEEPER-1636 - An ancient C issue which has patch available in Jira. I'm
planning to rebase it on master, but didn't have a chance yet.

All of the others are Maven/Doc related which Tamas and Norbert are working
on.

Flaky tests are related, but we don't tackle it as a blocker issue. Here's
the umbrella Jira that I've created to track the progress:
https://issues.apache.org/jira/browse/ZOOKEEPER-3170

Feel free to pick up any of the open ones or create new ones if you think
it's necessary. It's generally better to open individual Jiras for every
issue you're working on and discuss the details in it. You can open an
email thread too, if you feel convenient, but Jira is preferred.

Preferred workflow is Open Jira -> GitHub PR -> Commit to master ->
Backport to 3.5/3.4 if necessary -> Close Jira.

Thank you for your contribution again!

Andor



On Thu, Nov 22, 2018 at 12:51 PM Michael K. Edwards <m.k.edwards@gmail.com>
wrote:

> I think it's mostly a problem in CI, where other processes on the same
> machine may compete for the port range, producing spurious Jenkins
> failures.  The only failures I'm seeing locally are unrelated SSL
> issues.
> On Thu, Nov 22, 2018 at 3:45 AM Enrico Olivelli <eolivelli@gmail.com>
> wrote:
> >
> > Il giorno gio 22 nov 2018 alle ore 12:44 Michael K. Edwards
> > <m.k.edwards@gmail.com> ha scritto:
> > >
> > > I'm glad to be able to help.
> > >
> > > It appears as though some of the "flaky tests" result from another
> > > process stealing a server port between the time that it is assigned
> > > (in org.apache.zookeeper.PortAssignment.unique()) and the time that it
> > > is bound.
> >
> > You can try running tests using a single thread, this will "mitigate"
> > the problem a bit
> >
> > Enrico
> >
> > This happened, for example, in
> > >
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2708/;
> > > looking in the console text, I found:
> > >
> > >      [exec]     [junit] 2018-11-22 00:18:30,336 [myid:] - INFO
> > > [QuorumPeerListener:QuorumCnxManager$Listener@884] - My election bind
> > > port: localhost/127.0.0.1:19459
> > >      [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - INFO
> > > [QuorumPeer[myid=1](plain=/127.0.0.1:19457
> )(secure=disabled):NettyServerCnxnFactory@493]
> > > - binding to port localhost/127.0.0.1:19466
> > >      [exec]     [junit] 2018-11-22 00:18:30,337 [myid:] - ERROR
> > > [QuorumPeer[myid=1](plain=/127.0.0.1:19457
> )(secure=disabled):NettyServerCnxnFactory@497]
> > > - Error while reconfiguring
> > >      [exec]     [junit] org.jboss.netty.channel.ChannelException:
> > > Failed to bind to: localhost/127.0.0.1:19466
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.bootstrap.ServerBootstrap.bind(ServerBootstrap.java:272)
> > >      [exec]     [junit] at
> > >
> org.apache.zookeeper.server.NettyServerCnxnFactory.reconfigure(NettyServerCnxnFactory.java:494)
> > >      [exec]     [junit] at
> > >
> org.apache.zookeeper.server.quorum.QuorumPeer.processReconfig(QuorumPeer.java:1947)
> > >      [exec]     [junit] at
> > >
> org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:154)
> > >      [exec]     [junit] at
> > >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:93)
> > >      [exec]     [junit] at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1263)
> > >      [exec]     [junit] Caused by: java.net.BindException: Address
> > > already in use
> > >      [exec]     [junit] at sun.nio.ch.Net.bind0(Native Method)
> > >      [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:433)
> > >      [exec]     [junit] at sun.nio.ch.Net.bind(Net.java:425)
> > >      [exec]     [junit] at
> > > sun.nio.ch
> .ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> > >      [exec]     [junit] at
> > > sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.channel.socket.nio.NioServerBoss$RegisterTask.run(NioServerBoss.java:193)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.channel.socket.nio.NioServerBoss.run(NioServerBoss.java:42)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> > >      [exec]     [junit] at
> > >
> org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> > >      [exec]     [junit] at
> > >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> > >      [exec]     [junit] at
> > >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> > >      [exec]     [junit] at java.lang.Thread.run(Thread.java:748)
> > >
> > > We currently log-and-swallow this exception (and many others) down in
> > > NettyServerCnxnFactory.reconfigure() and
> > > NIOServerCnxnFactory.reconfigure(), which is ... not ideal.
> > >
> > > How should we handle a bind failure in the real world?  Seems like we
> > > ought to throw a BindException out at least as far as the caller of
> > > QuorumPeer.processReconfig().  That's either
> > > Follower/Leader/Learner/Observer or FastLeaderElection.  Presumably
> > > they should immediately go read-only when they can't bind the client
> > > port?
> > > On Thu, Nov 22, 2018 at 1:23 AM Enrico Olivelli <eolivelli@gmail.com>
> wrote:
> > > >
> > > > Thank you very much Michael
> > > > I am following and reviewing your patches
> > > >
> > > > Enrico
> > > > Il giorno gio 22 nov 2018 alle ore 10:14 Michael K. Edwards
> > > > <m.k.edwards@gmail.com> ha scritto:
> > > > >
> > > > > Hmm.  Jira's a bit of a boneyard, isn't it?  And timeouts in flaky
> > > > > tests are a problem.
> > > > >
> > > > > I scrubbed through the open bugs and picked the ones that looked
> to me
> > > > > like they might deserve attention for 3.5.5 or soon thereafter.
> > > > > They're all on my watchlist:
> > > > >
> https://issues.apache.org/jira/issues/?filter=-1&jql=watcher%20%3D%20mkedwards%20AND%20resolution%20%3D%20Unresolved%20ORDER%20BY%20created%20ASC
> > > > > (I'm not counting the Ant->Maven transition in that, which I don't
> > > > > know much about.)
> > > > >
> > > > > I'm trying out some more verbose logging for the junit tests, to
> try
> > > > > to understand test flakiness.  But the Jenkins pre-commit pipeline
> > > > > appears to be down?
> > > > > https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/
> > > > > On Wed, Nov 21, 2018 at 2:29 PM Michael K. Edwards
> > > > > <m.k.edwards@gmail.com> wrote:
> > > > > >
> > > > > > Looks like we're really close.  Can I help?
> > > > > >
> > > > > > I think this is the list of release blockers:
> > > > > >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20ZooKeeper%20and%20resolution%20%3D%20Unresolved%20and%20fixVersion%20%3D%203.5.5%20AND%20priority%20in%20(blocker%2C%20critical)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
> > > > > >
> > > > > > I currently see 7 issues in that search, of which 4 are aspects
> of the
> > > > > > ongoing switch from ant to maven.  Setting that aside for the
> moment,
> > > > > > there are 3 critical bugs:
> > > > > >
> > > > > > ZOOKEEPER-2778  Potential server deadlock between follower sync
> with
> > > > > > leader and follower receiving external connection requests.
> > > > > >
> > > > > > ZOOKEEPER-1636  c-client crash when zoo_amulti failed
> > > > > >
> > > > > > ZOOKEEPER-1818  Fix don't care for trunk
> > > > > >
> > > > > > I put them in that order because that's the order in which I've
> > > > > > stacked the fixes in
> > > > > > https://github.com/mkedwards/zookeeper/tree/branch-3.5.  Then
> on top
> > > > > > of that, I've updated the versions of the external library
> > > > > > dependencies I think it's important to update: Jetty, Jackson,
> and
> > > > > > BouncyCastle.  The result seems to be a green build in Jenkins:
> > > > > >
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2705/
> > > > > >
> > > > > > Are these fixes in principle landable on the 3.5 branch, or
do
> they
> > > > > > have to go to master first?  Does master need help to build
green
> > > > > > before these can land there?  Are there other bugs that are
> similarly
> > > > > > critical to fix, and not tagged for 3.5.5 in Jira?  Is there
> other
> > > > > > testing that I can help with?  Are more hands needed on the
Maven
> > > > > > work?
> > > > > >
> > > > > > Thanks for all the work that goes into keeping Zookeeper healthy
> and
> > > > > > advancing; it's a critical infrastructure component in several
> systems
> > > > > > I help develop and operate, and I like being able to rely on
it.
> > > > > >
> > > > > > Cheers,
> > > > > > - Michael
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message