hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexandre Hardy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader
Date Thu, 04 Nov 2010 10:33:42 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928174#action_12928174
] 

Alexandre Hardy commented on ZOOKEEPER-917:
-------------------------------------------

Hi Flavio,

At first pass this seems to indicate that we can't replace a failed zookeeper server by a
new one, but that statement is probably way too strong. If I understand correctly, what you
are saying is that the server can be replaced only after a new leader has been elected? i.e.
Any fresh server should only be restarted once the quorum has been reestablished?

I'm not sure I understand exactly why the election went wrong. Were the old election messages
resent when the Fresh server was contactable? I would have thought that election messages
should be based on the current state, and never send old state. 

This will take some time to digest and think through properly. In the meantime, can you suggest
how we should deal with this situation, can we simply wait for the two remaining nodes to
establish a quorum, and then reintroduce the third node? I suppose we could test if a quorum
has been established by testing if we can establish a new zookeeper session.

Thanks for the help


> Leader election selected incorrect leader
> -----------------------------------------
>
>                 Key: ZOOKEEPER-917
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>    Affects Versions: 3.2.2
>         Environment: Cloudera distribution of zookeeper (patched to never cache DNS entries)
> Debian lenny
>            Reporter: Alexandre Hardy
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>
>         Attachments: zklogs-20101102144159SAST.tar.gz
>
>
> We had three nodes running zookeeper:
>   * 192.168.130.10
>   * 192.168.130.11
>   * 192.168.130.14
> 192.168.130.11 failed, and was replaced by a new node 192.168.130.13 (automated startup).
The new node had not participated in any zookeeper quorum previously. The node 192.148.130.11
was permanently removed from service and could not contribute to the quorum any further (powered
off).
> DNS entries were updated for the new node to allow all the zookeeper servers to find
the new node.
> The new node 192.168.130.13 was selected as the LEADER, despite the fact that it had
not seen the latest zxid.
> This particular problem has not been verified with later versions of zookeeper, and no
attempt has been made to reproduce this problem as yet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message