hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Reed <br...@yahoo-inc.com>
Subject Re: cluster fails to start - broken snapshot?
Date Thu, 18 Mar 2010 17:57:42 GMT
we have updated ZOOKEEPER-713 with much more detail, but the bottom line 
is that the Invalid snapshot was caused by an OutOfMemoryError. this 
turns out not be a problem since we recover using an older snapshot. 
there are other things that are happening that are the real causes of 
the problem. see the jira for details.

thanx
ben

On 03/18/2010 09:16 AM, Łukasz Osipiuk wrote:
> Hi guys,
>
> Today we experienced another problem with our zookeeper installation.
> Due to large attachments I created jira issue for it, even though it
> is rather question than bug report.
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-713
>
> Description below:
>
> Today we had major failure in our production environment. Machines in
> zookeeper cluster gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in
> leader election phase.
>
> Calling stat command on any machine in the cluster resulted in
> 'ZooKeeperServer not running' message
> In one of logs I noticed 'Invalid snapshot' message which disturbed me a bit.
>
> We did not manage to make cluster work again with data. We deleted all
> version-2 directories on all nodes and then cluster started up without
> problems.
> Is it possible that snapshot/log data got corrupted in a way which
> made cluster unable to start?
> Fortunately we could rebuild data we store in zookeeper as we use it
> only for locks and most of nodes is ephemeral.
>
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart
> happened at 15:03.
> At some point later we experimented with deleting version-2 directory
> so I would not look at following restart because they can be
> misleading due to our actions.
>
> I am also attaching zoo.cfg. Maybe something is wrong at this place.
> As I know look into logs i see read timeout during initialization
> phase after 20secs (initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there
> any downsides of increasing tickTime.
>
> Best regards, Łukasz Osipiuk
>
> PS. due to attachment size limit I used split. to untar use
> cat nodeX-version-2.tgz-* |tar -xz
>
>    


Mime
View raw message