curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Phillips" <br...@etinternational.com>
Subject Re: Curator barriers missing watch events
Date Thu, 27 Mar 2014 17:36:44 GMT
So I guess I'm going to go back to using the double barrier recipe.


Jordan, are you a Curator contributor? Are you going to check in that race condition fix you
found for the next version of Curator?


  _____  

From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]
To: Brian Phillips [mailto:brian@etinternational.com], user@curator.apache.org
Sent: Thu, 27 Mar 2014 13:30:26 -0500
Subject: Re: Curator barriers missing watch events


https://cwiki.apache.org/confluence/display/CURATOR/TN1


:) 


 

From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 27, 2014 at 12:26:32 PM
To: user@curator.apache.org user@curator.apache.org
Subject:  Re: Curator barriers missing watch events 

 


                    I finally figured out my problem, and it was my fault. Hopefully  someone
else can learn from this.  

  
What was happening was that I was using a zookeeper watch  event to kick off a bunch of code,
which then ended up in the  zookeeper/curator barrier. Since the watch thread was the one
that  executed the barrier, it blocked itself from receiving any  additional watch events
from zookeeper, including the ones that the  barrier depended upon.  

  
So as a general rule, DON'T BLOCK YOUR WATCH THREADS. I feel  stupid for not realizing this
sooner.  

  
_Brian=
      _____  

  From: Brian Phillips  [mailto:brian@etinternational.com]
  To: user@curator.apache.org
  Sent: Wed, 26 Mar 2014 15:32:13 -0500
  Subject: Re: Curator barriers missing watch events
  
  So I'm still  working on this issue. I grabbed a zookeeper only barrier  implementation
from here:  
  
  
http://zookeeper.apache.org/doc/r3.3.3/zookeeperTutorial.html  

  
This barrier makes  it's own zookeeper connection separately from the curator  connection
that my program uses. When I put this barrier into my  program, everything works as it should,
and nobody gets stuck on  the barriers. I then modified the barrier to use curators  connection,
passing  in CuratorFramework.getZookeeperClient().getZooKeeper()  instead of connecting separately.
Once I did this, it breaks  exactly as it did before when using the curator barrier.  

  
This seems to indicate to me that something else I've done in  the program has 'broken' the
zookeeper session associated with my  curator connection, to the point where some watch events
no longer  work.  

  
I'm going to embark on the arduous process of trying to figure  out what I'm doing thats breaking
my sessions watches. Watches not  working properly is disturbing, and will certainly prevent
other  parts of my program from functioning correctly, probably in less  obvious ways.  

  
_Brian=  
      _____  

  From: Brian Phillips [mailto:brian@etinternational.com]
    To: user@curator.apache.org  [mailto:user@curator.apache.org]
  Sent: Tue, 25 Mar 2014 20:39:31 -0500
  Subject: Re: Curator barriers missing watch events
  
  
Yes, there's two barrier sessions. But different barrier  instances, and different barrier
paths. ):
  
  Sent from my iPhone  

  On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>  wrote:
  
    
  
  Are you saying there are two barrier sessions? The first one works,  but the second doesn’t?
Are you re-using the same path? I wonder if  there are znodes left in the path or something.
Before running the  second barrier session, double check that the path is empty (do a  getChildren
on it). If it’s not empty that could be the  problem.  
  
  
  -JZ  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: user@curator.apache.org  user@curator.apache.org
  Date: March 25, 2014 at 6:10:46  PM
  To: user@curator.apache.org  user@curator.apache.org
  Subject:  Re: Curator barriers  missing watch events
  
    
  
  
  

  I’ve tried, but it seems to be timing specific. Its in a rather  large complicated program,
where the first barrier always works but  the one at the end of the program usually gets stuck.
I’ve spent  all day trying to make sense of it, as my project really needs it  to work.
 

  I’d like to be able to figure out if the zookeeper server is  actually sending my clients
the watch events.   

  _B    

  On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>  wrote:
  
    
  
  There’s no way you can distill your usage into a test?  
  
  
  -JZ  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: user@curator.apache.org  user@curator.apache.org
  Date: March 25, 2014 at 5:51:37  PM
  To: user@curator.apache.org  user@curator.apache.org
  Subject:  Re: Curator barriers  missing watch events
  
    
  
  
Hmm, I made that change, but it didn't seem to help. The  first program made it to the barrier
enter, then the second program  entered, exited, and the first program never left the  barrier.
 

  
The second program got a node created event, but the  first program never got any event from
its watcher.  

  
I appreciate the help! Must be something  else.  

  _B  

  On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>  wrote:
  
    
  
  Look at line 313 and line 331. The noarg version of enter()  causes internalEnter() to call
wait even though the watcher  may have already notified. I believe line 331 should  be:  
  
  
  else if ( !hasBeenNotified.get() )  
  
  
  -JZ  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: user@curator.apache.org  user@curator.apache.org
  Date: March 25, 2014 at 5:25:48  PM
  To: user@curator.apache.org  user@curator.apache.org
  Subject:  Re: Curator barriers  missing watch events
  
    
  
  
I am using the no arg version! What's the bug?
  
  _B  

  On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>  wrote:
  
    
  
  Which version of enter() are you using? I see a potential bug  when the no arg version of
enter() is used.  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: Brian Phillips  brian@etinternational.com
    Date: March 25, 2014 at 4:19:36  PM
  To: Jordan Zimmerman jordan@jordanzimmerman.com
    Subject:  Re: Curator barriers  missing watch events
  
    
  
  
Good idea, but yes I am. The connection state doesn’t  change while I’m executing the
barrier code. It seems to be some  kind of race condition I think, as sometimes it work and
sometimes  it doesn’t. I’ve looked through the recipe code and it looks good  as far as
I can tell though. I’m practically pulling my hair out at  this point.  

  
I may try a non-curator zookeeper only barrier tomorrow.  See if that works. Or I may start
trying to debug the zookeeper  client, see if its actually getting the watches but not delivering
 them.  

  
_B  
  
  
On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman  <jordan@jordanzimmerman.com>  wrote:
 
    
  
  Are you setting a ConnectionStateListener? If the connection  gets SUSPENDED or LOST then
you’d need to reinitialize your  barrier.  
  
  
  -JZ  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: user@curator.apache.org user@curator.apache.org
    Date: March 25, 2014 at 2:51:42 PM
  To: user@curator.apache.org user@curator.apache.org
    Subject:  Re: Curator barriers missing  watch events 
  
    
  
  
I have tried writing a test program which launches two  programs in the same manor, each makes
a connection then loops over  barriers with a Thread.sleep(random) in-between. This run  indefinitely
and everything works out fine.  

  
I have also tried writing my own barrier, which uses a  SharedCount, where each guy tries
to increment it until it hits a  memberQty. This too missed watch events and does not work
 properly.  

  
It’s almost as if something else that I’ve done during  the running of my program has
broken zookeepers watch events  somehow. Is there any good way to debug watch events in general?
 I’ve tried to look at the DEBUG output for my zookeeper server log,  but it looks the same
for the working vs non-working  barriers...  

  
_B  
  
  
On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman  <jordan@jordanzimmerman.com>  wrote:
 
    
  
  Unfortunately, the barrier recipes aren’t widely used (from  what I know). So, there may
well be a bug. If you could get a test  to show the problem that would be ideal.  
  
  
  -JZ  
  
  
    

  From: Brian Phillips brian@etinternational.com
    Reply: user@curator.apache.org user@curator.apache.org
    Date: March 25, 2014 at 2:38:40 PM
  To: user@curator.apache.org user@curator.apache.org
    Subject:  Curator barriers missing watch  events 
  
  Hi guys, 
  
  I’ve been integrating curator into my project and have recently run  into an issue I just
can’t seem to make sense of. 
  
  I’m running two JVMs on the same host machine, each with their own  curator connection.
At the beginning of my program I’m using the  DistributedDoubleBarrier recipe, and once
again at the end of my  program. A bunch of work is done in-between, including zookeeper 
set/get/watches of other nodes. 
  
  I’m finding that the first double barrier, everyone always making  it through. The job-end
barrier, sometimes everyone gets through,  but more often than not one of the programs hangs
in enter's  wait(), and never gets the watch event for the ready path which  notifies it to
proceed. If I look in zookeeper, I can see that the  ready path is actually set in there.

  
  It would seem that the watch for one of the programs just never  triggers. 
  
  To simplify debugging, I’ve set both double barriers to only ever  call enter() and not
leave(). Both barriers have their own separate  path. 
  
  Also, the program never shuts down or disconnects from zookeeper.  It just sleeps infinitely
after it gets out of the final  barrier. 
  
  Any idea on how to debug this issue? I don’t mind hacking up  zookeeper/curator code to
insert my own debugging statements if it  comes to that. 
  
  _Brian=                    
                                                      
Mime
View raw message