curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Phillips" <br...@etinternational.com>
Subject Re: Curator barriers missing watch events
Date Tue, 25 Mar 2014 23:10:15 GMT
I’ve tried, but it seems to be timing specific. Its in a rather large complicated program,
where the first barrier always works but the one at the end of the program usually gets stuck.
I’ve spent all day trying to make sense of it, as my project really needs it to work.

I’d like to be able to figure out if the zookeeper server is actually sending my clients
the watch events. 

_B


> On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com> wrote:
> 
> There’s no way you can distill your usage into a test?
> 
> -JZ
> 
> 
> From: Brian Phillips brian@etinternational.com
> Reply: user@curator.apache.org user@curator.apache.org
> Date: March 25, 2014 at 5:51:37 PM
> To: user@curator.apache.org user@curator.apache.org
> Subject:  Re: Curator barriers missing watch events 
> 
>> Hmm, I made that change, but it didn't seem to help. The first program made it to
the barrier enter, then the second program entered, exited, and the first program never left
the barrier.
>> 
>> The second program got a node created event, but the first program never got any
event from its watcher.
>> 
>> I appreciate the help! Must be something else.
>> 
>> _B
>> 
>> On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>
wrote:
>> 
>>> Look at line 313 and line 331. The noarg version of enter() causes internalEnter()
to call wait even though the watcher may have already notified. I believe line 331 should
be:
>>> 
>>> else if ( !hasBeenNotified.get() )
>>> 
>>> -JZ
>>> 
>>> 
>>> From: Brian Phillips brian@etinternational.com
>>> Reply: user@curator.apache.org user@curator.apache.org
>>> Date: March 25, 2014 at 5:25:48 PM
>>> To: user@curator.apache.org user@curator.apache.org
>>> Subject:  Re: Curator barriers missing watch events
>>> 
>>>> I am using the no arg version! What's the bug?
>>>> 
>>>> _B
>>>> 
>>>> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>
wrote:
>>>> 
>>>>> Which version of enter() are you using? I see a potential bug when the
no arg version of enter() is used.
>>>>> 
>>>>> 
>>>>> From: Brian Phillips brian@etinternational.com
>>>>> Reply: Brian Phillips brian@etinternational.com
>>>>> Date: March 25, 2014 at 4:19:36 PM
>>>>> To: Jordan Zimmerman jordan@jordanzimmerman.com
>>>>> Subject:  Re: Curator barriers missing watch events
>>>>> 
>>>>>> Good idea, but yes I am. The connection state doesn’t change while
I’m executing the barrier code. It seems to be some kind of race condition I think, as sometimes
it work and sometimes it doesn’t. I’ve looked through the recipe code and it looks good
as far as I can tell though. I’m practically pulling my hair out at this point.
>>>>>> 
>>>>>> I may try a non-curator zookeeper only barrier tomorrow. See if that
works. Or I may start trying to debug the zookeeper client, see if its actually getting the
watches but not delivering them.
>>>>>> 
>>>>>> _B
>>>>>> 
>>>>>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman <jordan@jordanzimmerman.com>
wrote:
>>>>>>> 
>>>>>>> Are you setting a ConnectionStateListener? If the connection
gets SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>>>>> 
>>>>>>> -JZ
>>>>>>> 
>>>>>>> 
>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>> Date: March 25, 2014 at 2:51:42 PM
>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>> Subject:  Re: Curator barriers missing watch events 
>>>>>>> 
>>>>>>>> I have tried writing a test program which launches two programs
in the same manor, each makes a connection then loops over barriers with a Thread.sleep(random)
in-between. This run indefinitely and everything works out fine.
>>>>>>>> 
>>>>>>>> I have also tried writing my own barrier, which uses a SharedCount,
where each guy tries to increment it until it hits a memberQty. This too missed watch events
and does not work properly.
>>>>>>>> 
>>>>>>>> It’s almost as if something else that I’ve done during
the running of my program has broken zookeepers watch events somehow. Is there any good way
to debug watch events in general? I’ve tried to look at the DEBUG output for my zookeeper
server log, but it looks the same for the working vs non-working barriers...
>>>>>>>> 
>>>>>>>> _B
>>>>>>>> 
>>>>>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman <jordan@jordanzimmerman.com>
wrote:
>>>>>>>>> 
>>>>>>>>> Unfortunately, the barrier recipes aren’t widely used
(from what I know). So, there may well be a bug. If you could get a test to show the problem
that would be ideal.
>>>>>>>>> 
>>>>>>>>> -JZ
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> From: Brian Phillips brian@etinternational.com
>>>>>>>>> Reply: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>>>>>> To: user@curator.apache.org user@curator.apache.org
>>>>>>>>> Subject:  Curator barriers missing watch events 
>>>>>>>>> 
>>>>>>>>>> Hi guys, 
>>>>>>>>>> 
>>>>>>>>>> I’ve been integrating curator into my project and
have recently run into an issue I just can’t seem to make sense of. 
>>>>>>>>>> 
>>>>>>>>>> I’m running two JVMs on the same host machine,
each with their own curator connection. At the beginning of my program I’m using the DistributedDoubleBarrier
recipe, and once again at the end of my program. A bunch of work is done in-between, including
zookeeper set/get/watches of other nodes. 
>>>>>>>>>> 
>>>>>>>>>> I’m finding that the first double barrier, everyone
always making it through. The job-end barrier, sometimes everyone gets through, but more often
than not one of the programs hangs in enter's wait(), and never gets the watch event for the
ready path which notifies it to proceed. If I look in zookeeper, I can see that the ready
path is actually set in there. 
>>>>>>>>>> 
>>>>>>>>>> It would seem that the watch for one of the programs
just never triggers. 
>>>>>>>>>> 
>>>>>>>>>> To simplify debugging, I’ve set both double barriers
to only ever call enter() and not leave(). Both barriers have their own separate path. 
>>>>>>>>>> 
>>>>>>>>>> Also, the program never shuts down or disconnects
from zookeeper. It just sleeps infinitely after it gets out of the final barrier. 
>>>>>>>>>> 
>>>>>>>>>> Any idea on how to debug this issue? I don’t mind
hacking up zookeeper/curator code to insert my own debugging statements if it comes to that.

>>>>>>>>>> 
>>>>>>>>>> _Brian=
Mime
View raw message