zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Han (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2766) Quorum fails with java.io.EOFException
Date Tue, 25 Apr 2017 23:41:04 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983838#comment-15983838
] 

Michael Han commented on ZOOKEEPER-2766:
----------------------------------------

What is the latency of the network? Try compare the latency with initLimit / syncLimit see
if they are at similar scale. If not (for example if network is slow try increase initLimit/syncLimit
see if that alleviate the problem).
In certain cases due to ZOOKEEPER-900 / 901 the leader election might takes a long time to
finish.

> Quorum fails with java.io.EOFException
> --------------------------------------
>
>                 Key: ZOOKEEPER-2766
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2766
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.5.3
>         Environment: CentOS-7, Docker version 17.03.1-ce
>            Reporter: Patrick Kleindienst
>         Attachments: logs-node1.txt, logs-node2.txt, logs-node3.txt
>
>
> When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing the following
behavior:
> Two nodes (let's say node 2 and 3) out of the three start their own quorum, and finally
one of them is elected the new leader (node 3), while the other one becomes the follower (node
2). Node 1 seems to be able to establish a connection to node 3 (elected leader) in my case,
but this seems to fail for node 2. 
> Node 1 shows the following in its logs:
>  2017-04-25 09:24:02,806 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055]
- LOOKING
> 2017-04-25 09:24:02,808 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894]
- New election. My id =  1, proposed zxid=0x0
> 2017-04-25 09:24:02,811 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@688]
- Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 1 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:24:02,817 [myid:1] - WARN  [WorkerSender[myid=1]:QuorumCnxManager@457]
- Cannot open channel to 2 at election address /9.152.171.98:3888
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
>         at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
>         at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
>         at java.lang.Thread.run(Thread.java:745)
> 2017-04-25 09:24:02,822 [myid:1] - INFO  [WorkerSender[myid=1]:QuorumCnxManager@278]
- Have smaller server identifier, so dropping the connection: (3, 1)
> 2017-04-25 09:24:03,025 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457]
- Cannot open channel to 2 at election address /9.152.171.98:3888
> However, that's not all, since the quorum consisting of node 2 and 3 does not work properly.
The nodes' logs tell that leader election between these two works fine.
> Here's what node 3 (leader) says:
>  2017-04-25 09:09:33,842 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688]
- Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:33,844 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688]
- Notification: 2 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:33,851 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@688]
- Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:34,051 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
- Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
> 2017-04-25 09:09:34,052 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143]
- LEADING
> 2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63]
- TCP NoDelay set to: true
> 2017-04-25 09:09:34,055 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83]
- zookeeper.leader.maxConcurrentSnapshots = 10
> 2017-04-25 09:09:34,056 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85]
- zookeeper.leader.maxConcurrentSnapshotTimeout = 5
> And here's the output node 2 (follower) provides:
> 2017-04-25 09:09:31,875 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@688]
- Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING
(n.state), 2 (n.sid), 0x0 (
> n.peerEPoch), LOOKING (my state)0 (n.config version)
> 2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128]
- Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
> ica.2,name2=LeaderElection]
> 2017-04-25 09:09:32,077 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131]
- FOLLOWING
> 2017-04-25 09:09:32,082 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88]
- TCP NoDelay set to: true
> So far, so good. But seconds later the connection between node 2 and 3 seems to get lost,
causing leader node 3 to report an EOFExeption. If I understand the logs correctly, node 2
(follower) properly closes the connection (sending "Goodbye"), whilst node 3 says that the
socket is still open. 
> 2017-04-25 09:09:34,190 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414]
- LEADING - LEADER ELECTION TOOK - 138 MS
> 2017-04-25 09:09:34,197 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320]
- Snapshotting: 0x0 to /data/version-2/snapshot.0
> 2017-04-25 09:09:35,076 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382]
- Follower sid: 2 : info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
> 2017-04-25 09:09:35,113 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683]
- Synchronizing with Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0
peerLastZxid=0x
> 0
> 2017-04-25 09:09:35,114 [myid:3] - INFO  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727]
- Sending DIFF zxid=0x0 for peer sid: 2
> 2017-04-25 09:09:35,133 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258]
- Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and setting last processed
zxid: 0x100000000
> 2017-04-25 09:09:35,169 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255]
- Configuring CommitProcessor with 2 worker threads.
> 2017-04-25 09:09:35,179 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64]
- Using checkIntervalMs=60000 maxPerMinute=10000
> 2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@919]
- Connection broken for id 2, my id = 3, error = 
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> 2017-04-25 09:09:35,196 [myid:3] - WARN  [RecvWorker:2:QuorumCnxManager$RecvWorker@922]
- Interrupting SendWorker
> 2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@836]
- Interrupted while waiting for message on queue
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:35,197 [myid:3] - WARN  [SendWorker:2:QuorumCnxManager$SendWorker@845]
- Send worker leaving thread  id 2 my id = 3
> 2017-04-25 09:09:35,204 [myid:3] - ERROR [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604]
- Unexpected exception causing shutdown while sock still open
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
> 2017-04-25 09:09:35,204 [myid:3] - WARN  [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619]
- ******* GOODBYE /9.152.171.98:51328 ********
> 2017-04-25 09:09:37,181 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626]
- Shutting down
> 2017-04-25 09:09:37,182 [myid:3] - INFO  [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632]
- Shutdown called
> java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced
with sids: [ [3] ]
>         at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
>         at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)
> Unfortunately, node 2 does not provide any additional information on what exactly is
going on. After leader election, the only thing it reports is this:
> 2017-04-25 09:09:32,091 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68]
- FOLLOWING - LEADER ELECTION TOOK - 13 MS
> 2017-04-25 09:09:32,094 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273]
- Unexpected exception, tries=0, remaining init limit=9999, connecting to /9.152.171.12:288
> 8
> java.net.ConnectException: Connection refused (Connection refused)
>         at java.net.PlainSocketImpl.socketConnect(Native Method)
>         at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
>         at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
>         at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
>         at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>         at java.net.Socket.connect(Socket.java:589)
>         at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
>         at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
> 2017-04-25 09:09:33,142 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369]
- Getting a diff from the leader 0x0
> 2017-04-25 09:09:33,146 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516]
- Learner received NEWLEADER message
> 2017-04-25 09:09:33,207 [myid:2] - INFO  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499]
- Learner received UPTODATE message
> 2017-04-25 09:09:33,220 [myid:2] - WARN  [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446]
- Restarting Leader Election
> 2017-04-25 09:09:33,221 [myid:2] - INFO  [/0.0.0.0:3888:QuorumCnxManager$Listener@665]
- Leaving listener
> 2017-04-25 09:09:33,222 [myid:2] - WARN  [SendWorker:3:QuorumCnxManager$SendWorker@836]
- Interrupted while waiting for message on queue
> java.lang.InterruptedException
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>         at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
> 2017-04-25 09:09:33,222 [myid:2] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@919]
- Connection broken for id 3, my id = 2, error = 
> java.net.SocketException: Socket closed
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>         at java.net.SocketInputStream.read(SocketInputStream.java:171)
>         at java.net.SocketInputStream.read(SocketInputStream.java:141)
>         at java.net.SocketInputStream.read(SocketInputStream.java:224)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
> As far as I get that, node 2 wants to start a new leader election, but fails to establish
a connection to the other nodes. It tries over and over, finally ending up in a timeout. Unfortunately,
this doesn't give me any hint on what exactly breaks up the connection between the follower
node (node 2) and the leader node (node 3) and why it can be re-established. 
> It might also be relevant that I'm running ZooKeeper in Docker containers, using the
host network option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message