hadoop-zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Flavio Junqueira (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-917) Leader election selected incorrect leader
Date Thu, 04 Nov 2010 09:59:45 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928171#action_12928171

Flavio Junqueira commented on ZOOKEEPER-917:

The program I was using to open your logs was hiding some of the messages for some reason
unknown to me. I now understand why the leader was elected in your case and the behavior is
legitimate. Let me try to explain.

We currently repeat the last notification sent to a given server upon reconnecting to it.
This is to avoid problems with messages partially sent, and, assuming no further bugs, the
protocol is resilient to messages duplicates. At the same time, a server A decides to follow
another server B if it receives a message from B saying that B is leading and from a quorum
saying that they are following, even if A is in a later election epoch. This mechanism is
there to avoid A being locked out of the ensemble in the case it partitions away and comes
back later. 

>From you logs, what happens is:

# Fresh server 2 receives previous notifications from 0 and 1, and decide to lead;
# Server 1 receives the last message from server 0 saying that it is following 2 (which was
the previous leader), and the notification from 2 saying that it is leading. Server 1 consequently
decides to follow 2;
# Server 0 receives the last message from server 1 saying that it is following 2 (which was
the previous leader), and the notification from 2 saying that it is leading. Server 0 consequently
decides to follow 2.

Now the main problem I see is that the followers accept the snapshot from the leader, and
they shouldn't given that they have moved to a later epoch. I suspect that we currently allow
a server to come back to an epoch it has been in the past to again avoid having a server locked
out after being partitioned away and healing, but I need to do some further inspection.

My overall take is that your case is unfortunately not legitimate, meaning that we don't currently
provision for configuration changes. The case you expose in general constitutes a loss of
quorum, and that violates one of our core assumptions. In more detail, a quorum supporting
a leader must have a non-empty intersection with the quorum of servers that have accepted
requests in the previous epoch. Wiping out the state of server 2, by replacing it with a fresh
server, leads to the situation in which just one server contains all transactions accepted
by a quorum (and possibly committed). If you hadn't replaced server 2 with a fresh server,
then either server 2 would have been elected again just the same, and it would be fine because
it was previously the leader, or it wouldn't have been elected because the leader was previously
another server and the last notifications of 0 and 1 would be supporting a different server.

On reconfigurations, we have talked about it (http://wiki.apache.org/hadoop/ZooKeeper/ClusterMembership),
but we haven't made enough progress recently and it is currently not implemented. It would
be great to get some help here.

Let me know if this analysis makes any sense to you, please.

> Leader election selected incorrect leader
> -----------------------------------------
>                 Key: ZOOKEEPER-917
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-917
>             Project: Zookeeper
>          Issue Type: Bug
>          Components: leaderElection, server
>    Affects Versions: 3.2.2
>         Environment: Cloudera distribution of zookeeper (patched to never cache DNS entries)
> Debian lenny
>            Reporter: Alexandre Hardy
>            Priority: Critical
>             Fix For: 3.3.3, 3.4.0
>         Attachments: zklogs-20101102144159SAST.tar.gz
> We had three nodes running zookeeper:
>   *
>   *
>   *
> failed, and was replaced by a new node (automated startup).
The new node had not participated in any zookeeper quorum previously. The node
was permanently removed from service and could not contribute to the quorum any further (powered
> DNS entries were updated for the new node to allow all the zookeeper servers to find
the new node.
> The new node was selected as the LEADER, despite the fact that it had
not seen the latest zxid.
> This particular problem has not been verified with later versions of zookeeper, and no
attempt has been made to reproduce this problem as yet.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message