curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Blum <dragonsi...@gmail.com>
Subject Re: PLEASE REVIEW - Major re-work of Watcher wrappers
Date Mon, 08 Feb 2016 03:36:54 GMT
Oooh... let me try that.

On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <jordan@jordanzimmerman.com
> wrote:

> FYI - I added a new method to Timing to help with this:
>
> forSessionSleep()
>
> The new version of KillSession inserts an eventOfDeath directly into the
> client. So, no, it’s not a real session kill like it used to be. But, now
> it’s more reliable. Use timing.forSessionSleep() to wait for session
> timeout.
>
> -JZ
>
>
> On Feb 7, 2016, at 8:21 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>
> Can you describe the change then? Because kill session doesn't seem to now
> ensure that ephemeral nodes bound the the killed session disappear in a
> timely manner
> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>
> wrote:
>
>> Are we using a new zookeeper?
>>
>>
>> In Curator 3.0 with “new” connection state handling which simulates a
>> Session timeout. However, all tests are run twice. First with the old
>> handling and then with the new.
>>
>> Or did something change with our implementation of KillSession.kill()?
>>
>>
>> KillSession did change though. A change I made to ZK got added in 3.5 and
>> we now use that.
>>
>> -Jordan
>>
>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>>
>> Are we using a new zookeeper?  Or did something change with our
>> implementation of KillSession.kill()?
>>
>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>> State change: LOST?  I don't understand how we could get a LOST event
>> without the ephemeral node attached to that session having disappeared?
>>
>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>> connection states changed in Curator, but Curator now tests both the old
>>> mode and the new mode and they both fail here.
>>>
>>> -JZ
>>>
>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com> wrote:
>>>
>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>> branch is that the ephemeral node /test/me created in
>>> testKilledSession() really isn't disappearing when it should.
>>>
>>> After the session loss and the reconnect, /test still shows 2 children
>>> [foo, me] and /test/me still returns a node.
>>>
>>> Any idea why the timing here would have changed?
>>>
>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com>
>>> wrote:
>>>
>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>
>>>> I need to trace through what's really going on under the hood rather
>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>
>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> Any update on this? I think you should create a Jira for it.
>>>>>
>>>>> -Jordan
>>>>>
>>>>>
>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com>
wrote:
>>>>>
>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs. master
>>>>> issue.  I think I'm going to just have to dump in a ton of log messages
and
>>>>> see what differs.
>>>>>
>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>
>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>
>>>>>> -Jordan
>>>>>>
>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears to have
>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>
>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Okay, so I looked into this for a bit, and I hit kind of a wall.
 I
>>>>>>> think there is a legit bug/race in TreeCache, and the following
patch
>>>>>>> *should* remedy:
>>>>>>>
>>>>>>> diff --git
>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> index df4403c..a4a022b 100644
>>>>>>> ---
>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> +++
>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements Closeable
>>>>>>>          void wasDeleted() throws Exception
>>>>>>>          {
>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>> -
>>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>              ConcurrentMap<String, TreeNode> childMap =
>>>>>>> children.getAndSet(null);
>>>>>>>              if ( childMap != null )
>>>>>>>              {
>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements Closeable
>>>>>>>          case RECONNECTED:
>>>>>>>              try
>>>>>>>              {
>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>                  root.wasReconnected();
>>>>>>>
>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>> +                if ( outstandingOps.decrementAndGet() == 0 )
>>>>>>> +                {
>>>>>>> +                    if ( isInitialized.compareAndSet(false,
true) )
>>>>>>> +                    {
>>>>>>> +
>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>> +                    }
>>>>>>> +                }
>>>>>>>              }
>>>>>>>              catch ( Exception e )
>>>>>>>              {
>>>>>>>
>>>>>>> That should guarantee that the initialized event gets deferred
until
>>>>>>> all outstanding refreshes finish.. but it's not.  Something seems
to have
>>>>>>> changed under the hood in how background events are getting sent
to
>>>>>>> TreeCache, and I don't really understand it yet.  And running
the debugger
>>>>>>> seems to affect the timing, like something racy is going on.
:(
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman <
>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>
>>>>>>>>> No, sorry. The last few lines of the test currently are:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED, "/test/me",
"data".getBytes());
>>>>>>>>>
>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>
>>>>>>>>> You mean this?
>>>>>>>>>
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
"/test/me",
>>>>>>>>> "data".getBytes());
>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>
>>>>>>>>> Seems weird if there are two, but I can help look.
>>>>>>>>>
>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman <
>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey Scott,
>>>>>>>>>>
>>>>>>>>>> In this branch, TestTreeCache.testKilledSession()
is failing at:
>>>>>>>>>>
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
"/test/me",
>>>>>>>>>> "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
"/test/me",
>>>>>>>>>> "data".getBytes());
>>>>>>>>>>
>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>
>>>>>>>>>> -Jordan
>>>>>>>>>>
>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman
<
>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Devs,
>>>>>>>>>> >
>>>>>>>>>> > In trying to fix the bad log message "Failed
to find watcher”
>>>>>>>>>> (which turns out to be a ZK client issue), I realize
that the
>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff could be
improved. I’m still
>>>>>>>>>> working on getting all tests to pass but I’d appreciate
more sets of eyes
>>>>>>>>>> on this change. Please review carefully if you can.
>>>>>>>>>> >
>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>> >
>>>>>>>>>> > -Jordan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message