curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: PLEASE REVIEW - Major re-work of Watcher wrappers
Date Mon, 08 Feb 2016 01:29:18 GMT
FYI - I added a new method to Timing to help with this:
forSessionSleep()
The new version of KillSession inserts an eventOfDeath directly into the client. So, no, it’s
not a real session kill like it used to be. But, now it’s more reliable. Use timing.forSessionSleep()
to wait for session timeout.

-JZ


> On Feb 7, 2016, at 8:21 PM, Scott Blum <dragonsinth@gmail.com> wrote:
> 
> Can you describe the change then? Because kill session doesn't seem to now ensure that
ephemeral nodes bound the the killed session disappear in a timely manner
> 
> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com <mailto:jordan@jordanzimmerman.com>>
wrote:
>> Are we using a new zookeeper?
> 
> 
> In Curator 3.0 with “new” connection state handling which simulates a Session timeout.
However, all tests are run twice. First with the old handling and then with the new.
> 
>> Or did something change with our implementation of KillSession.kill()?
> 
> KillSession did change though. A change I made to ZK got added in 3.5 and we now use
that.
> 
> -Jordan
> 
>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dragonsinth@gmail.com <mailto:dragonsinth@gmail.com>>
wrote:
>> 
>> Are we using a new zookeeper?  Or did something change with our implementation of
KillSession.kill()?
>> 
>> Or maybe there's a timing issue with Curator's ConnectionStateManager State change:
LOST?  I don't understand how we could get a LOST event without the ephemeral node attached
to that session having disappeared?
>> 
>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>> I don’t know if anything changed in ZooKeeper itself. I know that the connection
states changed in Curator, but Curator now tests both the old mode and the new mode and they
both fail here.
>> 
>> -JZ
>> 
>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com <mailto:dragonsinth@gmail.com>>
wrote:
>>> 
>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0 branch is
that the ephemeral node /test/me created in testKilledSession() really isn't disappearing
when it should.
>>> 
>>> After the session loss and the reconnect, /test still shows 2 children [foo,
me] and /test/me still returns a node.
>>> 
>>> Any idea why the timing here would have changed?
>>> 
>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com <mailto:dragonsinth@gmail.com>>
wrote:
>>> https://issues.apache.org/jira/browse/CURATOR-302 <https://issues.apache.org/jira/browse/CURATOR-302>
>>> 
>>> I need to trace through what's really going on under the hood rather than band-aid
the test.  Should be able to in next couple of days.
>>> 
>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>>> Any update on this? I think you should create a Jira for it.
>>> 
>>> -Jordan
>>> 
>>> 
>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com <mailto:dragonsinth@gmail.com>>
wrote:
>>>> 
>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master issue.
 I think I'm going to just have to dump in a ton of log messages and see what differs.
>>>> 
>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>>>> OK - please create a new Issue in Jira for this.
>>>> 
>>>> -Jordan
>>>> 
>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com <mailto:dragonsinth@gmail.com>>
wrote:
>>>>> 
>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have been
broken for a while.  Maybe I'll have to git bisect...
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com
<mailto:dragonsinth@gmail.com>> wrote:
>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.  I think
there is a legit bug/race in TreeCache, and the following patch *should* remedy:
>>>>> 
>>>>> diff --git a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> index df4403c..a4a022b 100644
>>>>> --- a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> +++ b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>          void wasDeleted() throws Exception
>>>>>          {
>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>> -            client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>              ConcurrentMap<String, TreeNode> childMap = children.getAndSet(null);
>>>>>              if ( childMap != null )
>>>>>              {
>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>          case RECONNECTED:
>>>>>              try
>>>>>              {
>>>>> +                outstandingOps.incrementAndGet();
>>>>>                  root.wasReconnected();
>>>>>                  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>> +                {
>>>>> +                    if ( isInitialized.compareAndSet(false, true) )
>>>>> +                    {
>>>>> +                        publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> +                    }
>>>>> +                }
>>>>>              }
>>>>>              catch ( Exception e )
>>>>>              {
>>>>> 
>>>>> That should guarantee that the initialized event gets deferred until
all outstanding refreshes finish.. but it's not.  Something seems to have changed under the
hood in how background events are getting sent to TreeCache, and I don't really understand
it yet.  And running the debugger seems to affect the timing, like something racy is going
on. :(
>>>>> 
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com
<mailto:dragonsinth@gmail.com>> wrote:
>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>> 
>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>>>>> No, sorry. The last few lines of the test currently are:
>>>>> 
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> 
>>>>> This fails. But, if I switch them it works:
>>>>> 
>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me", "data".getBytes());
>>>>> 
>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com
<mailto:dragonsinth@gmail.com>> wrote:
>>>>>> 
>>>>>> So you end up with 2 initialized events?
>>>>>> 
>>>>>> You mean this?
>>>>>> 
>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>> 
>>>>>> Seems weird if there are two, but I can help look.
>>>>>> 
>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>>>>>> Hey Scott,
>>>>>> 
>>>>>> In this branch, TestTreeCache.testKilledSession() is failing at:
>>>>>> 
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
>>>>>> 
>>>>>> However, if I change the two asserts to:
>>>>>> 
>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
>>>>>> 
>>>>>> it works. Does that make any sense?
>>>>>> 
>>>>>> -Jordan
>>>>>> 
>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
<mailto:jordan@jordanzimmerman.com>> wrote:
>>>>>> >
>>>>>> > Devs,
>>>>>> >
>>>>>> > In trying to fix the bad log message "Failed to find watcher”
(which turns out to be a ZK client issue), I realize that the NamespaceWatcher and WatcherWrapper
stuff could be improved. I’m still working on getting all tests to pass but I’d appreciate
more sets of eyes on this change. Please review carefully if you can.
>>>>>> >
>>>>>> > https://github.com/apache/curator/pull/131 <https://github.com/apache/curator/pull/131>
>>>>>> >
>>>>>> > -Jordan
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message