zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@cloudera.com>
Subject Re: Observers taking a long time to recover after network outage
Date Tue, 07 Feb 2017 21:51:01 GMT
>> My expectation was that it
would reconnect once the network healed.

Right, it is intended to behave like that, but I see there are a couple of
cases that it could took longer to recover:

* Network condition is not stable after outage - for example the latency is
longer than what's configured for syncLimits initially but later drop to
the range within syncLimits.

* Currently ZooKeeper manages the connections between servers in a single
threaded blocking fashion, so if you have a large number of observers, it
would take a while for all of them gets reconnected.

BTW the leader log (if it's completed) indicates that nothing happens on
leader for the observer 10.110.11.16 <http://10.110.11.16:34573/> between
outage, which implies this is a networking issue between DCs.

>> Should I have some monitor process on the observer that checks if
"Zookeeper is not serving requests" and restart it?

I don't think that will make any big difference - "Zookeeper is not serving
requests" means the server process is started, but the server is not in a
state to serve request. For a server to serve request, the server has to
first connect to a leader (or make itself a leader somehow), synchronize
with the leader, and then could it start serving requests. Restart the
server would not impact anything here, as the restarted server has to go
through the same process.

On Fri, Feb 3, 2017 at 6:48 AM, Swaminathan Gnanaskandan <gswami@gmail.com>
wrote:

> Hi,
> We have a 5 node ZK-3.4.9-cluster running in DC1 and a couple of observers
> across WAN in DC2. There was a network outage between DC1 and DC2 for about
> 1 minute after which the observers took almost 15 minutes to connect to the
> leader in DC1.
> How can I ensure that observer recovers faster? My expectation was that it
> would reconnect once the network healed.
>
> zookeeper-leader.log
> ----------------------
> zookeeper.log.3:2017-02-02 05:37:22,045 - ERROR
> [LearnerHandler-/10.110.11.16:55076:LearnerHandler@631] - Unexpected
> exception causing shutdown while sock still open
> zookeeper.log.3:2017-02-02 05:37:22,045 - WARN
> [LearnerHandler-/10.110.11.16:55076:LearnerHandler@644] - ******* GOODBYE
> /
> 10.110.11.16:55076 ********
> zookeeper.log.3:2017-02-02 05:54:05,069 - INFO  [zk-02/10.100.104.82:3888
> :QuorumCnxManager$Listener@541] - Received connection request /
> 10.110.11.16:34573
> zookeeper.log.3:2017-02-02 05:54:05,249 - INFO
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@329] - Follower sid: 17
> : info : org.apache.zookeeper.server.quorum.QuorumPeer$
> QuorumServer@35bc2f85
> zookeeper.log.3:2017-02-02 05:54:05,320 - INFO
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@384] - Synchronizing
> with Follower sid: 17 maxCommittedLog=0x104c66a22
> minCommittedLog=0x104c6682e peerLastZxid=0x100919196
> zookeeper.log.3:2017-02-02 05:54:05,320 - WARN
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@451] - Unhandled
> proposal scenario
> zookeeper.log.3:2017-02-02 05:54:05,321 - INFO
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@458] - Sending SNAP
> zookeeper.log.3:2017-02-02 05:54:05,321 - INFO
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@482] - Sending snapshot
> last zxid of peer is 0x100919196  zxid of leader is 0x104c66a22sent zxid of
> db as 0x104c66a22
> zookeeper.log.3:2017-02-02 05:54:06,208 - INFO
> [LearnerHandler-/10.110.11.16:40257:LearnerHandler@518] - Received
> NEWLEADER-ACK message from 17
>
>
> zookeeper-observer.log
> ----------------------
> zookeeper.log.2:2017-02-02 05:37:22,637 - WARN
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@77] - Exception when
> observing the leader
> zookeeper.log.2: at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:833)
> zookeeper.log.2:2017-02-02 05:37:22,638 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@135] - shutdown called
> zookeeper.log.2: at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:837)
> zookeeper.log.2:2017-02-02 05:37:22,672 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@497] - shutting
> down
> zookeeper.log.2:2017-02-02 05:37:22,672 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ObserverRequestProcessor@121] -
> Shutting down
> zookeeper.log.2:2017-02-02 05:37:22,672 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
> down
> zookeeper.log.2:2017-02-02 05:37:22,672 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> shutdown of request processor complete
> zookeeper.log.2:2017-02-02 05:37:22,672 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> zookeeper.log.2:2017-02-02 05:37:22,673 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  17, proposed zxid=0x8000000000000000
> zookeeper.log.2:2017-02-02 05:37:22,874 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> zookeeper.log.2:2017-02-02 05:37:23,274 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> zookeeper.log.2:2017-02-02 05:37:24,075 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> zookeeper.log.2:2017-02-02 05:37:25,675 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 3200
> zookeeper.log.2:2017-02-02 05:37:28,876 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 6400
> zookeeper.log.2:2017-02-02 05:37:35,276 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 12800
> zookeeper.log.2:2017-02-02 05:37:48,076 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 25600
> zookeeper.log.2:2017-02-02 05:38:13,677 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 51200
> zookeeper.log.2:2017-02-02 05:39:04,877 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.2:2017-02-02 05:40:04,878 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.2:2017-02-02 05:41:04,878 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.2:2017-02-02 05:42:04,879 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.2:2017-02-02 05:43:04,879 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.2:2017-02-02 05:44:04,879 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:45:04,880 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:46:04,880 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:47:04,881 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:48:04,881 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:49:04,881 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:50:04,882 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:51:04,882 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:52:04,882 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:53:04,883 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:54:04,883 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> zookeeper.log.1:2017-02-02 05:54:05,103 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer@831] - OBSERVING
> zookeeper.log.1:2017-02-02 05:54:05,103 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
> server with tickTime 3000 minSessionTimeout 6000 maxSessionTimeout 60000
> datadir /var/lib/zookeeper/data/version-2 snapdir
> /var/lib/zookeeper/data/version-2
> zookeeper.log.1:2017-02-02 05:54:05,103 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:ObserverZooKeeperServer@64] -
> syncEnabled =false
> zookeeper.log.1:2017-02-02 05:54:05,108 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
> Resolved hostname: zk-02 to address: zk-02/10.100.104.82
> zookeeper.log.1:2017-02-02 05:54:05,108 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Observer@65] - Observing zk-02/
> 10.100.104.82:2888
> zookeeper.log.1:2017-02-02 05:54:05,182 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
> /var/lib/zookeeper/data/version-2/snapshot.100919196
> zookeeper.log.1:2017-02-02 05:54:05,353 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Learner@329] - Getting a
> snapshot
> from leader
> zookeeper.log.1:2017-02-02 05:54:06,109 - INFO
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@240] -
> Snapshotting: 0x104c66a22 to
> /var/lib/zookeeper/data/version-2/snapshot.104c66a22
> zookeeper.log.1:2017-02-02 05:54:06,169 - WARN
> [QuorumPeer[myid=17]/0:0:0:0:0:0:0:0:2181:Learner@408] - Got zxid
> 0x104c66a23 expected 0x1
>
> Should I have some monitor process on the observer that checks if
> "Zookeeper is not serving requests" and restart it?
>
> Regards,
> Swami
>



-- 
Cheers
Michael.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message