curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: Curator barriers missing watch events
Date Tue, 25 Mar 2014 20:16:32 GMT
One thing to know is that it’s not possible to get every ZK event. I don’t know if that
helps. If it’s not too big, I can do a code review on your code. Of course, let’s not
rule out a Curator bug. I’ll have a re-look at the Barrier code when I get a chance.

-JZ


From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:51:42 PM
To: user@curator.apache.org user@curator.apache.org
Subject:  Re: Curator barriers missing watch events  

I have tried writing a test program which launches two programs in the same manor, each makes
a connection then loops over barriers with a Thread.sleep(random) in-between. This run indefinitely
and everything works out fine.

I have also tried writing my own barrier, which uses a SharedCount, where each guy tries to
increment it until it hits a memberQty. This too missed watch events and does not work properly.

It’s almost as if something else that I’ve done during the running of my program has broken
zookeepers watch events somehow. Is there any good way to debug watch events in general? I’ve
tried to look at the DEBUG output for my zookeeper server log, but it looks the same for the
working vs non-working barriers...

_B

On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jordan@jordanzimmerman.com> wrote:

Unfortunately, the barrier recipes aren’t widely used (from what I know). So, there may
well be a bug. If you could get a test to show the problem that would be ideal.

-JZ


From: Brian Phillips brian@etinternational.com
Reply: user@curator.apache.org user@curator.apache.org
Date: March 25, 2014 at 2:38:40 PM
To: user@curator.apache.org user@curator.apache.org
Subject:  Curator barriers missing watch events 

Hi guys, 

I’ve been integrating curator into my project and have recently run into an issue I just
can’t seem to make sense of. 

I’m running two JVMs on the same host machine, each with their own curator connection. At
the beginning of my program I’m using the DistributedDoubleBarrier recipe, and once again
at the end of my program. A bunch of work is done in-between, including zookeeper set/get/watches
of other nodes. 

I’m finding that the first double barrier, everyone always making it through. The job-end
barrier, sometimes everyone gets through, but more often than not one of the programs hangs
in enter's wait(), and never gets the watch event for the ready path which notifies it to
proceed. If I look in zookeeper, I can see that the ready path is actually set in there. 

It would seem that the watch for one of the programs just never triggers. 

To simplify debugging, I’ve set both double barriers to only ever call enter() and not leave().
Both barriers have their own separate path. 

Also, the program never shuts down or disconnects from zookeeper. It just sleeps infinitely
after it gets out of the final barrier. 

Any idea on how to debug this issue? I don’t mind hacking up zookeeper/curator code to insert
my own debugging statements if it comes to that. 

_Brian=


Mime
View raw message