zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <fpjunque...@yahoo.com.INVALID>
Subject Re: entire cluster dies with EOFException
Date Sun, 06 Jul 2014 15:03:31 GMT
That's interesting, I came across a similar case last week, but the snapshot size reported
was 4Gb, not 4Mb like here (we report the approximate data size in bytes, right?). I thought
it was the snapshot size causing problems and recommended increasing the initLimit value,
but maybe the snapshot size reported was wrong and the problem is similar to this one. The
use case was storm as well...

-Flavio

On 06 Jul 2014, at 12:48, Aaron Zimmerman <azimmerman@sproutsocial.com> wrote:

> Raúl,
> 
> zk_approximate_data_size 4899392
> 
> That is about the size of the snapshots also.
> 
> Benjamin,
> 
> We are not running out of disk space.
> But the log.XXXX files are quite large, is this normal?  In less than 3
> hours, the log file since the last snapshot is 8.2G, and the older log
> files are as large as 12G.
> 
> We are using Storm Trident, this uses zookeeper pretty heavily for tracking
> transactional state, but i'm not sure if that could account for this much
> storage.  Is there an easy way to track which znodes are being updated most
> frequently?
> 
> Thanks,
> 
> Aaron
> 
> 
> 
> 
> 
> On Sun, Jul 6, 2014 at 1:55 AM, Raúl Gutiérrez Segalés <rgs@itevenworks.net>
> wrote:
> 
>> What's the total size of the data in your ZK cluster? i.e.:
>> 
>> $ echo mntr | nc localhost 2181 | grep zk_approximate_data_size
>> 
>> And/or the size of the snapshot?
>> 
>> 
>> -rgs
>> 
>> 
>> On 4 July 2014 06:29, Aaron Zimmerman <azimmerman@sproutsocial.com> wrote:
>> 
>>> Hi all,
>>> 
>>> We have a 5 node zookeeper cluster that has been operating normally for
>>> several months.  Starting a few days ago, the entire cluster crashes a
>> few
>>> times per day, all nodes at the exact same time.  We can't track down the
>>> exact issue, but deleting the snapshots and logs and restarting resolves.
>>> 
>>> We are running exhibitor to monitor the cluster.
>>> 
>>> It appears that something bad gets into the logs, causing an EOFException
>>> and this cascades through the entire cluster:
>>> 
>>> 2014-07-04 12:55:26,328 [myid:1] - WARN
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>>> following the leader
>>> java.io.EOFException
>>>        at java.io.DataInputStream.readInt(DataInputStream.java:375)
>>>        at
>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>>>        at
>>> 
>>> 
>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>>>        at
>>> 
>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>>>        at
>>> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
>>>        at
>>> 
>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
>>> 2014-07-04 12:55:26,328 [myid:1] - INFO
>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
>> called
>>> java.lang.Exception: shutdown Follower
>>>        at
>>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>>>        at
>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
>>> 
>>> 
>>> Then the server dies, exhibitor tries to restart each node, and they all
>>> get stuck trying to replay the bad transaction, logging things like:
>>> 
>>> 
>>> 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading
>>> snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
>>> 2014-07-04 12:58:52,896 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
>>> /var/lib/zookeeper/version-2/log.300000021
>>> 2014-07-04 12:58:52,915 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
>>> /var/lib/zookeeper/version-2/log.300000021
>>> 2014-07-04 12:59:25,870 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
>> java.io.EOFException:
>>> Failed to read /var/lib/zookeeper/version-2/log.300000021
>>> 2014-07-04 12:59:25,871 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
>>> /var/lib/zookeeper/version-2/log.300011fc2
>>> 2014-07-04 12:59:25,872 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
>>> /var/lib/zookeeper/version-2/log.300011fc2
>>> 2014-07-04 12:59:48,722 [myid:1] - DEBUG
>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
>> java.io.EOFException:
>>> Failed to read /var/lib/zookeeper/version-2/log.300011fc2
>>> 
>>> And the cluster is dead.  The only way we have found to recover is to
>>> delete all of the data and restart.
>>> 
>>> Anyone seen this before?  Any ideas how I can track down what is causing
>>> the EOFException, or insulate zookeeper from completely crashing?
>>> 
>>> Thanks,
>>> 
>>> Aaron Zimmerman
>>> 
>> 


Mime
View raw message