zookeeper-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jean-Marc Spaggiari (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-2669) follower failed to reconnect to leader after a network error
Date Mon, 29 Jul 2019 18:45:00 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895521#comment-16895521
] 

Jean-Marc Spaggiari commented on ZOOKEEPER-2669:
------------------------------------------------

I'm curious to see if anyone else faced that. I god few "Got zxid 0x320001019f expected 0x1"
in my logs too. Wondering if it can be related to network issues.

> follower failed to  reconnect to leader after a network error
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2669
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2669
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.4.9
>         Environment: CentOS7
>            Reporter: Zhenghua Chen
>            Priority: Major
>
> We have a zookeeper cluster with 3 nodes named s1, s2, s3
> By mistake, we shut down the ethernet interface of s2, and zk follower  shut down(zk
process remains there)
> Later, after ethernet up again, s2 failed to reconnect to leader s3 to be a follower
> follower s2 keeps printing log like this:
> {quote}
> 2017-01-19 16:40:58,956 WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:7181] o.a.z.s.q.Learner
- Got zxid 0x320001019f expected 0x1
> 2017-01-19 16:40:58,956 ERROR [SyncThread:1] o.a.z.s.ZooKeeperCriticalThread - Severe
unrecoverable error, from thread : SyncThread:1
> java.nio.channels.ClosedChannelException: null
> 	at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:99)
> 	at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:250)
> 	at org.apache.zookeeper.server.persistence.Util.padLogFile(Util.java:215)
> 	at org.apache.zookeeper.server.persistence.FileTxnLog.padFile(FileTxnLog.java:241)
> 	at org.apache.zookeeper.server.persistence.FileTxnLog.append(FileTxnLog.java:219)
> 	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.append(FileTxnSnapLog.java:314)
> 	at org.apache.zookeeper.server.ZKDatabase.append(ZKDatabase.java:470)
> 	at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:140)
> 2017-01-19 16:40:58,956 INFO  [SyncThread:1] o.a.z.s.ZooKeeperServerListenerImpl - Thread
SyncThread:1 exits, error code 1
> 2017-01-19 16:40:58,956 INFO  [SyncThread:1] o.a.z.s.SyncRequestProcessor - SyncRequestProcessor
exited!
> 2017-01-19 16:40:58,957 INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:7181] o.a.z.s.q.Learner
- shutdown called
> java.lang.Exception: shutdown Follower
> 	at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:164)
> 	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> {quote}
> And, leader s3 keeps printing log like this:
> {quote}
> 2017-01-19 16:30:50,452 INFO  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@95258f0
> 2017-01-19 16:30:50,452 INFO  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Synchronizing with Follower sid: 1 maxCommittedLog=0x320001019e minCommittedLog=0x320000ffaa
peerLastZxid=0x2300000000
> 2017-01-19 16:30:50,453 WARN  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Unhandled proposal scenario
> 2017-01-19 16:30:50,453 INFO  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Sending SNAP
> 2017-01-19 16:30:50,453 INFO  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Sending snapshot last zxid of peer is 0x2300000000  zxid of leader is 0x320001019esent zxid
of db as 0x320001019e
> 2017-01-19 16:30:50,461 INFO  [LearnerHandler-/192.168.40.51:35949] o.a.z.s.q.LearnerHandler
- Received NEWLEADER-ACK message from 1
> 2017-01-19 16:30:51,738 ERROR [LearnerHandler-/192.168.40.51:35934] o.a.z.s.q.LearnerHandler
- Unexpected exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
> 	at java.net.SocketInputStream.socketRead0(Native Method)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:152)
> 	at java.net.SocketInputStream.read(SocketInputStream.java:122)
> 	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> 	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
> 	at java.io.DataInputStream.readInt(DataInputStream.java:387)
> 	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> 	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> 	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
> 	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:542)
> {quote}
> we execute netstat, found lots of close wait socket in s2,  and never closed.
> {quote}
> tcp6   10865      0 192.168.40.51:47181     192.168.40.57:7288      CLOSE_WAIT  2217/java
          
> tcp6    2576      0 192.168.40.51:57181     192.168.40.57:7288      CLOSE_WAIT  2217/java
          
> {quote}
> seems that s2 has a connection leak.
> after restart zk process of s2, it works fine.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Mime
View raw message