zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flavio Junqueira" <fpjunque...@yahoo.com>
Subject RE: ZOOKEEPER-900 / 901 / 1678
Date Wed, 30 Apr 2014 22:13:34 GMT
Sure, having the logs might help.

-Flavio

-----Original Message-----
From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com] 
Sent: Wednesday, April 30, 2014 11:10 PM
To: user@zookeeper.apache.org
Subject: Re: ZOOKEEPER-900 / 901 / 1678

Thanks Flavio,
The length of the leader election seems directly related to the presence of this dead host
in the configuration though. If I remove the dead host from the configuration, a quorum is
quickly formed. From the logs it does appear that the election is completing though (after
about 15 seconds in most cases), but then another election seems to happen shortly afterwards.

Would it be helpful for me to provide debug level logs?
cheers


On Thu, May 1, 2014 at 8:04 AM, Flavio Junqueira <fpjunqueira@yahoo.com>wrote:

> Leader election seems to be taking a long time. The connection 
> attempts from QuorumCnxManager are not causing a new round of leader 
> election. What causes it is the absence of a quorum of supporters, so 
> the elected leader is not getting enough servers to support it.
>
> -Flavio
>
> -----Original Message-----
> From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> Sent: Wednesday, April 30, 2014 10:36 PM
> To: user@zookeeper.apache.org
> Subject: Re: ZOOKEEPER-900 / 901 / 1678
>
> I've done a bit more testing this morning, and it appears that the 
> leader election is actually completing, but then just after the 
> election has completed, the connection attempt to the dead host times 
> out, and this seems to cause another leader election. The same thing 
> happens the next leader election. etc.
>
> 2014-04-30 04:07:25,383 [myid:3] - INFO 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2183:Leader@358] - LEADING - 
> LEADER ELECTION TOOK - 14662
> 2014-04-30 04:07:25,756 [myid:3] - WARN 
> [WorkerSender[myid=3]:QuorumCnxManager@382] - Cannot open channel to 2 
> at election address /10.0.0.0:3889
> java.net.SocketTimeoutException: connect timed out
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at
>
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>         at
>
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>         at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:579)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:368)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:341)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:449)
>         at
>
> org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:430)
>         at java.lang.Thread.run(Thread.java:744)
> 2014-04-30 04:07:25,757 [myid:3] - INFO 
> [WorkerReceiver[myid=3]:FastLeaderElection@597] - Notification: 1 
> (message format version), 3 (n.leader), 0xc00000001 (n.zxid), 0xb 
> (n.round), LOOKING (n.state), 3 (n.sid), 0xd (n.peerEpoch) LEADING (my
> state)
>
> cheers
>
>
>
> On Wed, Apr 30, 2014 at 6:48 PM, Cameron McKenzie 
> <mckenzie.cam@gmail.com
> >wrote:
>
> > hey Flavio,
> > Thanks for the quick reply.
> >
> > I'm running ZK 3.4.6. Having looked into the code a bit more, I 
> > think that I was slightly presumptuous about the root cause. The 
> > actual socket connects seem to be passing a timeout correctly, and 
> > based on the logs, I can see the timeouts on connect occurring.
> >
> > I can reproduce the issue on a VM running two instances of ZK. These 
> > instances are configured in a 3 node cluster (with the 2 real ZK 
> > instances, and one bogus IP address that will not resolve to 
> > anything
> useful).
> > Specifically, this bogus host is configured 2nd in the server list.
> > When I configured it third, the cluster would occasionally form a 
> > quorum (though still not consistently). I've attached the config and 
> > logs from both of the ZK instances.
> >
> > Any help would be much appreciated!
> > cheers
> >
> >
> >
> >
> > On Wed, Apr 30, 2014 at 6:09 PM, FPJ <fpjunqueira@yahoo.com> wrote:
> >
> >> Hi Cameron,
> >>
> >> Which version of ZK are you using? Also, if you can share logs, 
> >> then it might be easier for us to help you out.
> >>
> >> -Flavio
> >>
> >> > -----Original Message-----
> >> > From: Cameron McKenzie [mailto:mckenzie.cam@gmail.com]
> >> > Sent: 30 April 2014 08:44
> >> > To: zookeeper-user@hadoop.apache.org
> >> > Subject: ZOOKEEPER-900 / 901 / 1678
> >> >
> >> > ZooKeeper users,
> >> > Does anyone know the status of these issues? They don't seem to 
> >> > have had anything done to them since late 2010?
> >> >
> >> > I think that we're experiencing the same issue currently. If we 
> >> > have a
> >> 3 node
> >> > cluster for example, and 1 of these nodes is completely dead (i.e 
> >> > the
> >> entire
> >> > host is not contactable due to a power outage), I would expect 
> >> > that a quorum could still be formed, but this does not appear to 
> >> > be the
> case.
> >> >
> >> > I haven't delved into the code too much, but it appears that 
> >> > blocking
> >> IO is
> >> > being used for the connect. This doesn't respect the socket SO 
> >> > timeout
> >> being
> >> > set, so it means that the connect() call can block for some 
> >> > arbitrary
> >> amount of
> >> > time (based on the OS level TCP settings?). This in turn means 
> >> > that
> >> leader
> >> > election will fail because it times out before the socket connect 
> >> > does,
> >> even
> >> > though there are enough live hosts present to form a quorum.
> >> >
> >> > This seems like a fairly fundamental problem, unless I'm missing
> >> something.
> >> > If a single host goes down due to a power failure for example, it 
> >> > can
> >> prevent
> >> > any further hosts joining the cluster. In addition, if after a 
> >> > power
> >> failure,
> >> > enough hosts come back online to form a quorum, but some don't, 
> >> > that a quorum may still not be able to be formed.
> >> > cheers
> >> > Cam
> >>
> >>
> >
>
>


Mime
View raw message