zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: entire cluster dies with EOFException
Date Fri, 04 Jul 2014 21:02:37 GMT
I’ve seen EOF errors when the 1MB limit has been reached. Check to see if any ZNodes have
thousands of children and/or big payloads.

-JZ


From: Aaron Zimmerman azimmerman@sproutsocial.com
Reply: user@zookeeper.apache.org user@zookeeper.apache.org
Date: July 4, 2014 at 8:30:09 AM
To: user@zookeeper.apache.org user@zookeeper.apache.org
Subject:  entire cluster dies with EOFException  

Hi all,  

We have a 5 node zookeeper cluster that has been operating normally for  
several months. Starting a few days ago, the entire cluster crashes a few  
times per day, all nodes at the exact same time. We can't track down the  
exact issue, but deleting the snapshots and logs and restarting resolves.  

We are running exhibitor to monitor the cluster.  

It appears that something bad gets into the logs, causing an EOFException  
and this cascades through the entire cluster:  

2014-07-04 12:55:26,328 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when  
following the leader  
java.io.EOFException  
at java.io.DataInputStream.readInt(DataInputStream.java:375)  
at  
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)  
at  
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)  
at  
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)  
at  
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)  
at  
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)  
at  
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)  
2014-07-04 12:55:26,328 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called  
java.lang.Exception: shutdown Follower  
at  
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)  
at  
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)  


Then the server dies, exhibitor tries to restart each node, and they all  
get stuck trying to replay the bad transaction, logging things like:  


2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading  
snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0  
2014-07-04 12:58:52,896 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@575] - Created new input stream  
/var/lib/zookeeper/version-2/log.300000021  
2014-07-04 12:58:52,915 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@578] - Created new input archive  
/var/lib/zookeeper/version-2/log.300000021  
2014-07-04 12:59:25,870 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException:  
Failed to read /var/lib/zookeeper/version-2/log.300000021  
2014-07-04 12:59:25,871 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@575] - Created new input stream  
/var/lib/zookeeper/version-2/log.300011fc2  
2014-07-04 12:59:25,872 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@578] - Created new input archive  
/var/lib/zookeeper/version-2/log.300011fc2  
2014-07-04 12:59:48,722 [myid:1] - DEBUG  
[main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException:  
Failed to read /var/lib/zookeeper/version-2/log.300011fc2  

And the cluster is dead. The only way we have found to recover is to  
delete all of the data and restart.  

Anyone seen this before? Any ideas how I can track down what is causing  
the EOFException, or insulate zookeeper from completely crashing?  

Thanks,  

Aaron Zimmerman  

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message