zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@apache.org>
Subject Re: entire cluster dies with EOFException
Date Sun, 06 Jul 2014 06:49:52 GMT
On Fri, Jul 4, 2014 at 10:35 PM, Aaron Zimmerman <
azimmerman@sproutsocial.com> wrote:

> Thanks for getting back to me.
>
> Jordan,
>
> I don't think we are doing any large nodes or thousands of children.  We
> are using zookeeper for storm and service discovery, so things are pretty
> modest.
>
> Camille,
>
> I've created https://issues.apache.org/jira/browse/ZOOKEEPER-1955, and
> attached the snapshot causing the EOF exception (I think..?), let me know
> if you can discover anything from the snapshot.
>
> Thanks,
>
> Aaron Zimmerman
>
>
>
> On Fri, Jul 4, 2014 at 4:02 PM, Jordan Zimmerman <
> jordan@jordanzimmerman.com
> > wrote:
>
> > I’ve seen EOF errors when the 1MB limit has been reached. Check to see if
> > any ZNodes have thousands of children and/or big payloads.
> >
> > -JZ
> >
> >
> > From: Aaron Zimmerman azimmerman@sproutsocial.com
> > Reply: user@zookeeper.apache.org user@zookeeper.apache.org
> > Date: July 4, 2014 at 8:30:09 AM
> > To: user@zookeeper.apache.org user@zookeeper.apache.org
> > Subject:  entire cluster dies with EOFException
> >
> > Hi all,
> >
> > We have a 5 node zookeeper cluster that has been operating normally for
> > several months. Starting a few days ago, the entire cluster crashes a few
> > times per day, all nodes at the exact same time. We can't track down the
> > exact issue, but deleting the snapshots and logs and restarting resolves.
> >
> > We are running exhibitor to monitor the cluster.
> >
> > It appears that something bad gets into the logs, causing an EOFException
> > and this cascades through the entire cluster:
> >
> > 2014-07-04 12:55:26,328 [myid:1] - WARN
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> > following the leader
> > java.io.EOFException
> > at java.io.DataInputStream.readInt(DataInputStream.java:375)
> > at
> > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > at
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >
> > at
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> > at
> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> > at
> >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> > at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> > 2014-07-04 12:55:26,328 [myid:1] - INFO
> > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> > java.lang.Exception: shutdown Follower
> > at
> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> > at
> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> >
> >
> > Then the server dies, exhibitor tries to restart each node, and they all
> > get stuck trying to replay the bad transaction, logging things like:
> >
> >
> > 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading
> > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> > 2014-07-04 12:58:52,896 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:58:52,915 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:59:25,870 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> > java.io.EOFException:
> > Failed to read /var/lib/zookeeper/version-2/log.300000021
> > 2014-07-04 12:59:25,871 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > /var/lib/zookeeper/version-2/log.300011fc2
> > 2014-07-04 12:59:25,872 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > /var/lib/zookeeper/version-2/log.300011fc2
> > 2014-07-04 12:59:48,722 [myid:1] - DEBUG
> > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> > java.io.EOFException:
> > Failed to read /var/lib/zookeeper/version-2/log.300011fc2
> >
> > And the cluster is dead. The only way we have found to recover is to
> > delete all of the data and restart.
> >
> > Anyone seen this before? Any ideas how I can track down what is causing
> > the EOFException, or insulate zookeeper from completely crashing?
> >
> > Thanks,
> >
> > Aaron Zimmerman
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message