zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Reed (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-713) zookeeper fails to start - broken snapshot?
Date Thu, 18 Mar 2010 16:24:27 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12846976#action_12846976
] 

Benjamin Reed commented on ZOOKEEPER-713:
-----------------------------------------

hey, just for a sanity check: you aren't running out of disk space are you?

> zookeeper fails to start - broken snapshot?
> -------------------------------------------
>
>                 Key: ZOOKEEPER-713
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-713
>             Project: Zookeeper
>          Issue Type: Bug
>    Affects Versions: 3.2.2
>         Environment: debian lenny; ia64; xen virtualization
>            Reporter: Lukasz Osipiuk
>         Attachments: node1-version-2.tgz-aa, node1-version-2.tgz-ab, node1-zookeeper.log.gz,
node2-version-2.tgz-aa, node2-version-2.tgz-ab, node2-version-2.tgz-ac, node2-zookeeper.log.gz,
node3-version-2.tgz-aa, node3-version-2.tgz-ab, node3-version-2.tgz-ac, node3-zookeeper.log.gz,
zoo.cfg
>
>
> Hi guys,
> The following is not a bug report but rather a question - but as I am attaching large
files I am posting it here rather than on mailinglist.
> Today we had major failure in our production environment. Machines in zookeeper cluster
gone wild and all clients got disconnected.
> We tried to restart whole zookeeper cluster but cluster got stuck in leader election
phase.
> Calling stat command on any machine in the cluster resulted in 'ZooKeeperServer not running'
message
> In one of logs I noticed 'Invalid snapshot'  message which disturbed me a bit.
> We did not manage to make cluster work again with data. We deleted all version-2 directories
on all nodes and then cluster started up without problems.
> Is it possible that snapshot/log data got corrupted in a way which made cluster unable
to start?
> Fortunately we could rebuild data we store in zookeeper as we use it only for locks and
most of nodes is ephemeral.
> I am attaching contents of version-2 directory from all nodes and server logs.
> Source problem occurred some time before 15. First cluster restart happened at 15:03.
> At some point later we experimented with deleting version-2 directory so I would not
look at following restart because they can be misleading due to our actions.
> I am also attaching zoo.cfg. Maybe something is wrong at this place. 
> As I know look into logs i see read timeout during initialization phase after 20secs
(initLimit=10, tickTime=2000).
> Maybe all I have to do is increase one or other. which one? Are there any downsides of
increasing tickTime.
> Best regards, Ɓukasz Osipiuk
> PS. due to attachment size limit I used split. to untar use 
> cat nodeX-version-2.tgz-* |tar -xz

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message