zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: cluster fails to start - broken snapshot?
Date Thu, 18 Mar 2010 17:57:42 GMT
we have updated ZOOKEEPER-713 with much more detail, but the bottom line 
is that the Invalid snapshot was caused by an OutOfMemoryError. this 
turns out not be a problem since we recover using an older snapshot. 
there are other things that are happening that are the real causes of 
the problem. see the jira for details.


On 03/18/2010 09:16 AM, Łukasz Osipiuk wrote:
> Hi guys,
> Today we experienced another problem with our zookeeper installation.
> Due to large attachments I created jira issue for it, even though it
> is rather question than bug report.
> https://issues.apache.org/jira/browse/ZOOKEEPER-713
> Description below:
> Today we had major failure in our production environment. Machines in
> zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in
> leader election phase.
> Calling stat command on any machine in the cluster resulted in
> 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all
> version-2 directories on all nodes and then cluster started up without
> problems.
> Is it possible that snapshot/log data got corrupted in a way which
> made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it
> only for locks and most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart
> happened at 15:03.
> At some point later we experimented with deleting version-2 directory
> so I would not look at following restart because they can be
> misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place.
> As I know look into logs i see read timeout during initialization
> phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there
> any downsides of increasing tickTime.
> Best regards, Łukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use
> cat nodeX-version-2.tgz-* |tar -xz

View raw message