curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Hodges <robert.hod...@continuent.com>
Subject Re: Switching from State suspended, to lost, to suspended
Date Thu, 14 Nov 2013 19:07:19 GMT
Hi, 

I have been looking at the same problem as Henrik.  Just to be clear, the problem is the following:
 a process wants to make state updates that are only safe to do while it has the leader role.
 

If this is correctly stated, there are three cases that are interesting.  

a.) Ensuring within the process that you have the leadership role when you start. 
b.) Ensuring that the process does not give up leadership while such updates are proceeding.

c.) Handling the case where the process loses leadership during the operation, leading to
a late update 

I was planning on handling cases a and b using a shared lock within each process that can
become leader.  To perform updates threads need to acquire the shared lock.  This is only
granted if the process has leadership to begin with.  To give up leadership you need to acquire
the lock exclusively, which means the leader callback must wait for the shared locks to be
released before return to Curator. 

Case c is the hard one.  One option is to put a callback on the lock so that clients holding
it will receive an interrupt.  However, there's still a race condition hiding under there
as Arie points out, so this is only a partial solution--in fact it's really identical to checking
the flags as described below.  

This could be largely cured if Curator had semantics such that it would not try to select
a new leader before ensuring that the old leader had actually processed the interrupt and
properly exited.  

What are the Curator leader selection semantics in this case?  If Curator does not do something
like what I described it's almost trivially easy to get overlapping leaders. 

Cheers, Robert Hodges

p.s., If there's interest in the lock approach I would be happy to prepare a patch so it can
be added to Curator.

On Nov 14, 2013, at 8:11 AM PST, Arie Zilberstein wrote:

> Henrik,
> 
> You should be able to transactionally test for leadership and update a state a varaible
in Zookeeper.
> This is something that I requested a few weeks ago in a thread named "Atomically setting
a node's data while having leadership", and I hope will be implemented. Personally I think
it is a must-have capability.
> 
> In your scenario, however, since you must update a database, there is a race condition
that cannot be readily resolved (without some kind of distributed transactions). You can test
for leadership and then update the DB, but there is no guarantee that the leadership is still
yours by the end of your DB update call.
> 
> Thanks,
> Arie 
> 
> 
> On Wed, Nov 13, 2013 at 4:02 PM, Henrik Nordvik <henrikno@gmail.com> wrote:
> I've upgraded to curator 2.3.0.
> LeaderSelector still uses thread interrupting for signaling to the thread running takeLeadership()
to stop, right?
> Inside my takeLeadership I do some database operations, and before commiting I'm checking
if I was interrupted, and roll back if I was.
> However, some code in between clears the interrupt flag (i.e. logback does this), so
I'm committing even though I lost/suspended the connection.
> 
> I need some other criteria to decide if I can commit or not. hasLeadership only checks
a local flag, which is always true inside takeLeadership().
> Do I have another flag I can check?
> 
> 
> --
> Henrik Nordvik
> 
> 
> On Tue, Nov 5, 2013 at 5:21 PM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:
> This sounds like a variation of https://issues.apache.org/jira/browse/CURATOR-54 - The
next release of Curator (later this week) provides a more robust way of canceling leadership
that doesn’t require thread interruption.
> 
> -Jordan
> 
> On Nov 5, 2013, at 1:47 AM, Henrik Nordvik <henrikno@gmail.com> wrote:
> 
>> Hi,
>> 
>> I'm getting some strange behaviour when stopping zookeeper in one environment that
I can't reproduce locally.
>> The result is that the leader selector "quits" even though it is set as auto-requeue.
(I think that happens because the retry loop inside LeaderSelector checks the interrupt-flag,
which is set again even when I cleared it).
>> 
>> I think it boils down to getting
>> 
>> 2013-11-04 18:22:32,501 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager
     - State change: LOST
>> 2013-11-04 18:22:32,501 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:32,503 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager
     - State change: SUSPENDED
>> 2013-11-04 18:22:32,504 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> ... then I handle the interrupt in the leader thread.
>> 
>> Then I get this:
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager
     - State change: LOST
>> 2013-11-04 18:22:36,465 INFO  [main-EventThread    ] c.n.c.f.state.ConnectionStateManager
     - State change: SUSPENDED
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - StateChanged: LOST 
>> 2013-11-04 18:22:36,465 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - StateChanged: SUSPENDED 
>> 2013-11-04 18:22:36,466 DEBUG [ectionStateManager-0] s.f.s.a.feed.MyListener    
   - Interrupting thread Thread[LeaderSelector-0,5,main]
>> 
>> 
>> Full log is here: https://gist.github.com/zerd/7316258
>> 
>> The code follows the old leader selector example pretty well:
>> 
>>     @Override
>>     public void takeLeadership(CuratorFramework curatorFramework) throws Exception
{
>>         ourThread = Thread.currentThread();
>>         logger.debug(format("(%s) Got leadership", ourThread));
>>         try {
>>             waitForAndPerformWork();
>>         } catch (InterruptedException e) {
>>             logger.debug(format("(%s) Interrupted ", ourThread), e);
>>         } finally {
>>             logger.debug(format("(%s) No longer leader", ourThread));
>>         }
>>     }
>> 
>>     @Override
>>     public void stateChanged(CuratorFramework curatorFramework, ConnectionState newState)
{
>>         logger.debug("StateChanged: " + newState);
>> 
>>         if ((newState == ConnectionState.LOST) || (newState == ConnectionState.SUSPENDED))
{
>>             if (ourThread != null) {
>>                 logger.debug("Interrupting thread " + ourThread);
>>                 ourThread.interrupt();
>>             } else {
>>                 logger.debug("Thread is null");
>>             }
>>         }
>>     }
>> 
>> Is it supposed to go back and forth from lost to suspended?
>> My goal is to get it to resume trying to get the leadership when zookeeper comes
back. Do I have to requeue it manually when this happens?
>> Would upgrading to latest curator with CancelLeadershipException fix this?
>> 
>> Thank you very much for your time.
>> 
>> --
>> Henrik Nordvik
> 
> 
> 


Mime
View raw message