curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jordan Zimmerman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-498) LeaderLatch deletes leader and leaves it hung besides a second leader
Date Sat, 29 Dec 2018 22:03:00 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730835#comment-16730835
] 

Jordan Zimmerman commented on CURATOR-498:
------------------------------------------

In your summary of events how do you know that a given ZNode belongs to a particular session?
I don't see anything in your sample code that examines the ephemeralOwner. Also, in the tx
log I don't see session 0x1000ae5465c0007 shutting down until the end (it appears that all
ZK connections are closed).

I don't see how an expired session can appear to create a ZNode. Every time a connection is
repaired Curator's LeaderLatch calls {{reset()}} which deletes any known leader node and recreates
a new one. Even if there was a create in-flight stuck in a retry loop the successful create
callback calls {{setNode(event.getName());}} which will cause the previously set node to get
deleted. Whatever new node is being set is considered valid by ZooKeeper.

So, I suspect something else is going on here. I have a wild guess that it has something to
do with retries and session loss but I need to think a lot more about it.

> LeaderLatch deletes leader and leaves it hung besides a second leader
> ---------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Major
>         Attachments: HaWatcher.log, LeaderLatch0.java, ha.tar.gz, logs.tar.gz
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a while and then
restore it, then after Curator in my app restores it's connection to ZK - sometimes not all
the 6 clients are found in the latch path (using zkCli.sh). That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the leader
in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, and from
them it seems like very old create() background callback was surprisingly scheduled and corrupted
the current leader with its stale path name. Meaning, this old one called setNode with its
stale name, and set itself instead of the leader and deleted the leader. This leaves client
running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete create callback
cancelled (I think its session was suspended on 22:38:54 and then lost on 22:39:04 - so on
SUSPENDED cancel ongoing callbacks).
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that on 22:39:26 it shows that 0000000485 is replaced by 0000000480
and then probably deleted.
> Note also that at 22:38:52, 34 seconds before, we can see that it was in the reset()
method ("RESET OUR PATH") and possibly triggered the creation of 0000000480 then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message