hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Hardy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader
Date Thu, 04 Nov 2010 06:33:42 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928137#action_12928137
] 

Alexandre Hardy commented on ZOOKEEPER-917:
-------------------------------------------

The excerpts are extracted from {{hbase-0.20/hbase*.log}}, so the information should be readily
available.
The tar file contents should be as follows:
{noformat}
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.10/
drwxr-xr-x ah/users          0 2010-11-03 13:33 192.168.130.10/hbase-0.20/
-rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.out
-rw-r--r-- ah/users   62922921 2010-11-02 14:42 192.168.130.10/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-d3.log
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.12/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.12/hbase-0.20/
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.13/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.13/hbase-0.20/
-rw-r--r-- ah/users   65903411 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.log
-rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.13/hbase-0.20/hbase--zookeeper-e0-cb-4e-65-4d-4e.out
drwxr-xr-x ah/users          0 2010-11-02 14:42 192.168.130.14/
drwxr-xr-x ah/users          0 2010-11-03 13:27 192.168.130.14/hbase-0.20/
-rw-r--r-- ah/users          0 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.out
-rw-r--r-- ah/users   62835121 2010-11-02 14:42 192.168.130.14/hbase-0.20/hbase--zookeeper-e0-cb-4e-71-8-a8.log
{noformat}

The only logs that are missing are those for .11, but that should not influence the analysis
of the leader election (I hope).

We are using monitoring software which determines when a zookeeper instance is no longer reachable,
and automatically starts a fresh zookeeper instance as replacement. This software can determine
the failure and start a new zookeeper instance fairly rapidly. Would it be better to delay
the start of a fresh zookeeper instance to allow the existing instances to elect a new leader?
If so, do you have any guidelines regarding this delay? (We are considering this approach,
but would like to avoid it if possible).

{quote}
In your case, I'm still not sure why it happens because the initial zxid of node 1 is 4294967742
according to your excerpt. 
{quote}
That is indeed the key question that I am trying to find an answer for! :-)

> Leader election selected incorrect leader
> -----------------------------------------
>
>                 Key: ZOOKEEPER-917
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>    Affects Versions: 3.2.2
>         Environment: Cloudera distribution of zookeeper (patched to never cache DNS entries)
> Debian lenny
>            Reporter: Alexandre Hardy
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: zklogs-20101102144159SAST.tar.gz
>
>
> We had three nodes running zookeeper:
>   * 192.168.130.10
>   * 192.168.130.11
>   * 192.168.130.14
> 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 (automated startup).
The new node had not participated in any zookeeper quorum previously. The node 192.148.130.11
was permanently removed from service and could not contribute to the quorum any further (powered
off).
> DNS entries were updated for the new node to allow all the zookeeper servers to find
the new node.
> The new node 192.168.130.13 was selected as the LEADER, despite the fact that it had
not seen the latest zxid.
> This particular problem has not been verified with later versions of zookeeper, and no
attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message