zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raúl Gutiérrez Segalés <...@itevenworks.net>
Subject Re: entire cluster dies with EOFException
Date Sun, 06 Jul 2014 06:55:36 GMT
What's the total size of the data in your ZK cluster? i.e.:

$ echo mntr | nc localhost 2181 | grep zk_approximate_data_size

And/or the size of the snapshot?


-rgs


On 4 July 2014 06:29, Aaron Zimmerman <azimmerman@sproutsocial.com> wrote:

> Hi all,
>
> We have a 5 node zookeeper cluster that has been operating normally for
> several months.  Starting a few days ago, the entire cluster crashes a few
> times per day, all nodes at the exact same time.  We can't track down the
> exact issue, but deleting the snapshots and logs and restarting resolves.
>
> We are running exhibitor to monitor the cluster.
>
> It appears that something bad gets into the logs, causing an EOFException
> and this cascades through the entire cluster:
>
> 2014-07-04 12:55:26,328 [myid:1] - WARN
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:375)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
>         at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> 2014-07-04 12:55:26,328 [myid:1] - INFO
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>         at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
>
>
> Then the server dies, exhibitor tries to restart each node, and they all
> get stuck trying to replay the bad transaction, logging things like:
>
>
> 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading
> snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> 2014-07-04 12:58:52,896 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:58:52,915 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:59:25,870 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException:
> Failed to read /var/lib/zookeeper/version-2/log.300000021
> 2014-07-04 12:59:25,871 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> /var/lib/zookeeper/version-2/log.300011fc2
> 2014-07-04 12:59:25,872 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> /var/lib/zookeeper/version-2/log.300011fc2
> 2014-07-04 12:59:48,722 [myid:1] - DEBUG
> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton java.io.EOFException:
> Failed to read /var/lib/zookeeper/version-2/log.300011fc2
>
> And the cluster is dead.  The only way we have found to recover is to
> delete all of the data and restart.
>
> Anyone seen this before?  Any ideas how I can track down what is causing
> the EOFException, or insulate zookeeper from completely crashing?
>
> Thanks,
>
> Aaron Zimmerman
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message