zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Nauroth (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ZOOKEEPER-1005) Zookeeper servers fail to elect a leader succesfully.
Date Tue, 21 Jun 2016 22:21:58 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris Nauroth updated ZOOKEEPER-1005:
-------------------------------------
    Fix Version/s:     (was: 3.5.2)
                   3.5.3

> Zookeeper servers fail to elect a leader succesfully.
> -----------------------------------------------------
>
>                 Key: ZOOKEEPER-1005
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1005
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.2.2
>         Environment: zookeeper-3.2.2; debian
>            Reporter: Alexandre Hardy
>             Fix For: 3.6.0, 3.5.3
>
>
> We were running 3 zookeeper servers, and simulated a failure on one of the servers. 
> The one zookeeper node follows the other, but has trouble connecting. It looks like the
following exception is the cause:
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumPeer]
FOLLOWING
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.ZooKeeperServer]
Created server
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower]
Following zookeeper3/192.168.131.11:2888
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower]
Unexpected exception, tries=0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING java.net.ConnectException: --  Connection
refused
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.PlainSocketImpl.socketConnect(Native
Method)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-60 WARNING  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> {noformat}
> The last exception while connecting was:
> {noformat}
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower]
Unexpected exception
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR java.net.ConnectException: --  Connection
refused
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.PlainSocketImpl.socketConnect(Native
Method)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:310)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:176)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:163)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at java.net.Socket.connect(Socket.java:546)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:156)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 ERR  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:549)
> 2011-03-01T14:02:33+02:00 e0-cb-4e-65-4d-60 WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.Follower]
Exception when following the leader
> {noformat}
> The leader started leading a bit later 
> {noformat}
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Notification: 0, 94489312534, 25, 2, LOOKING, LOOKING, 0
> 2011-03-01T14:02:29+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Adding vote
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager]
Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:32+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumPeer]
LEADING
> {noformat}
> But at that time the follower had already terminated and started a new election, so the
leader failed:
> {noformat}
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.ZooKeeperServer]
Created server
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.persistence.FileSnap]
Reading snapshot /var/lib/zookeeper/version-2/snapshot.1600007d16
> 2011-03-01T14:02:50+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.persistence.FileTxnSnapLog]
Snapshotting: 1600007d16
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager]
Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:02:53+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Sending new notification.
> 2011-03-01T14:03:11+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Sending new notification.
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING [zookeeper] --  [org.apache.zookeeper.server.quorum.QuorumCnxManager]
Cannot open channel to 1 at election address zookeeper2/192.168.132.10:3888
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:302)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:323)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d WARNING  --     at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:296)
> 2011-03-01T14:03:14+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Sending new notification.
> 2011-03-01T14:03:32+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.FastLeaderElection]
Sending new notification.
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO [zookeeper] --  [org.apache.zookeeper.server.quorum.Leader]
Shutdown called
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:371)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:297)
> 2011-03-01T14:03:34+02:00 e0-cb-4e-65-4d-7d INFO  --     at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:562)
> {noformat}
> From http://zookeeper.apache.org/doc/r3.2.2/zookeeperStarted.html:
> {quote}
> The new entry, initLimit is timeouts ZooKeeper uses to limit the length of time the ZooKeeper
servers in quorum have to connect to a leader
> {quote}
> Since we have initLimit=10 and tickTime=4000, we should have 40 seconds for a zookeeper
server to contact the leader.
> However, in the source code src/java/main/org/apache/zookeeper/server/quorum/Follower.java:
> {noformat}
> 152                 for (int tries = 0; tries < 5; tries++) {
> 153                     try {
> 154                         //sock = new Socket();
> 155                         //sock.setSoTimeout(self.tickTime * self.initLimit);
> 156                         sock.connect(addr, self.tickTime * self.syncLimit);
> 157                         sock.setTcpNoDelay(nodelay);
> 158                         break;
> 159                     } catch (IOException e) {
> 160                         if (tries == 4) {
> 161                             LOG.error("Unexpected exception",e);
> 162                             throw e;
> 163                         } else {
> 164                             LOG.warn("Unexpected exception, tries="+tries,e);
> 165                             sock = new Socket();
> 166                             sock.setSoTimeout(self.tickTime * self.initLimit);
> 167                         }
> 168                     }
> 169                     Thread.sleep(1000);
> 170                 }
> {noformat}
> It appears as if we only have 4 seconds to contact the leader. The timeouts are applied
to the socket, but do not take into account that the zookeeper leader may not have started
its zookeeper service yet. 
> Is this the expected behaviour? Or is the expected behaviour that followers should always
be able to connect to the leader?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message