zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raúl Gutiérrez Segalés <...@itevenworks.net>
Subject Re: entire cluster dies with EOFException
Date Mon, 07 Jul 2014 16:35:39 GMT
On 6 July 2014 14:26, Flavio Junqueira <fpjunqueira@yahoo.com.invalid>
wrote:

> But what is it that was causing problems in your scenario, Raul? Is it
> reading the log? In any case, it sounds like initLimit is the parameter you
> want to change, no?
>

Yeah, I think so. It was just that it took too long to walk through all the
txns (too many of them). So finding the sweet spot of snapshots vs
transactions is a bit tricky in this case I think.


-rgs




>
> -Flavio
>
> On 06 Jul 2014, at 19:09, Raúl Gutiérrez Segalés <rgs@itevenworks.net>
> wrote:
>
> > Oh, storm right. Yeah I've seen this. The transaction rate is so huge the
> > the initial sync fails.. perhaps you could try bigger tickTime, initLimit
> > and syncLimit params...
> >
> >
> > -rgs
> >
> >
> > On 6 July 2014 04:48, Aaron Zimmerman <azimmerman@sproutsocial.com>
> wrote:
> >
> >> Raúl,
> >>
> >> zk_approximate_data_size 4899392
> >>
> >> That is about the size of the snapshots also.
> >>
> >> Benjamin,
> >>
> >> We are not running out of disk space.
> >> But the log.XXXX files are quite large, is this normal?  In less than 3
> >> hours, the log file since the last snapshot is 8.2G, and the older log
> >> files are as large as 12G.
> >>
> >> We are using Storm Trident, this uses zookeeper pretty heavily for
> tracking
> >> transactional state, but i'm not sure if that could account for this
> much
> >> storage.  Is there an easy way to track which znodes are being updated
> most
> >> frequently?
> >>
> >> Thanks,
> >>
> >> Aaron
> >>
> >>
> >>
> >>
> >>
> >> On Sun, Jul 6, 2014 at 1:55 AM, Raúl Gutiérrez Segalés <
> >> rgs@itevenworks.net>
> >> wrote:
> >>
> >>> What's the total size of the data in your ZK cluster? i.e.:
> >>>
> >>> $ echo mntr | nc localhost 2181 | grep zk_approximate_data_size
> >>>
> >>> And/or the size of the snapshot?
> >>>
> >>>
> >>> -rgs
> >>>
> >>>
> >>> On 4 July 2014 06:29, Aaron Zimmerman <azimmerman@sproutsocial.com>
> >> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> We have a 5 node zookeeper cluster that has been operating normally
> for
> >>>> several months.  Starting a few days ago, the entire cluster crashes
a
> >>> few
> >>>> times per day, all nodes at the exact same time.  We can't track down
> >> the
> >>>> exact issue, but deleting the snapshots and logs and restarting
> >> resolves.
> >>>>
> >>>> We are running exhibitor to monitor the cluster.
> >>>>
> >>>> It appears that something bad gets into the logs, causing an
> >> EOFException
> >>>> and this cascades through the entire cluster:
> >>>>
> >>>> 2014-07-04 12:55:26,328 [myid:1] - WARN
> >>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> >> when
> >>>> following the leader
> >>>> java.io.EOFException
> >>>>        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >>>>        at
> >>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> >>>>        at
> >>>>
> >>>>
> >>>
> >>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> >>>>        at
> >>>>
> >>>
> >>
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> >>>>        at
> >>>>
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> >>>>        at
> >>>>
> >>>
> >>
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> >>>>        at
> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> >>>> 2014-07-04 12:55:26,328 [myid:1] - INFO
> >>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> >>> called
> >>>> java.lang.Exception: shutdown Follower
> >>>>        at
> >>>>
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> >>>>        at
> >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> >>>>
> >>>>
> >>>> Then the server dies, exhibitor tries to restart each node, and they
> >> all
> >>>> get stuck trying to replay the bad transaction, logging things like:
> >>>>
> >>>>
> >>>> 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading
> >>>> snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> >>>> 2014-07-04 12:58:52,896 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> >>>> /var/lib/zookeeper/version-2/log.300000021
> >>>> 2014-07-04 12:58:52,915 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> >>>> /var/lib/zookeeper/version-2/log.300000021
> >>>> 2014-07-04 12:59:25,870 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> >>> java.io.EOFException:
> >>>> Failed to read /var/lib/zookeeper/version-2/log.300000021
> >>>> 2014-07-04 12:59:25,871 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> >>>> /var/lib/zookeeper/version-2/log.300011fc2
> >>>> 2014-07-04 12:59:25,872 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> >>>> /var/lib/zookeeper/version-2/log.300011fc2
> >>>> 2014-07-04 12:59:48,722 [myid:1] - DEBUG
> >>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> >>> java.io.EOFException:
> >>>> Failed to read /var/lib/zookeeper/version-2/log.300011fc2
> >>>>
> >>>> And the cluster is dead.  The only way we have found to recover is to
> >>>> delete all of the data and restart.
> >>>>
> >>>> Anyone seen this before?  Any ideas how I can track down what is
> >> causing
> >>>> the EOFException, or insulate zookeeper from completely crashing?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Aaron Zimmerman
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message