curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Blum <dragonsi...@gmail.com>
Subject Re: PLEASE REVIEW - Major re-work of Watcher wrappers
Date Mon, 08 Feb 2016 17:18:48 GMT
That totally fixed it!  Thanks, Jordan!

https://github.com/apache/curator/pull/132

On Mon, Feb 8, 2016 at 12:08 PM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> Well, you could add back an alternate version KillSession that works as it
> used it. I’m OK with that.
>
> -JZ
>
>
> On Feb 8, 2016, at 12:06 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>
> Okay, I took a look, and I couldn't see how to use
> timing.forSessionSleep() in this test correctly.  It's not the test harness
> I'm trying to block.  What I'm trying to test involves seeing how TreeCache
> reacts to a *real* session loss.  The fact that KillSession isn't a real
> session loss anymore is therefore impeding my ability to test what I was
> trying to test before.
>
> Let me explain.  The following manual test does what I want:
>
> 1) Connect a TreeCache to a remote ZK server through an ssh tunnel.
> 2) Follow the node creation steps in testKilledSession()
> 3) Kill the tunnel, wait for session loss, restore the tunnel.
>
> So I guess my question is, is there any way to "really" kill the session,
> server side?  Or should I formulate this test completely differently?
>
> Another way to look at this is that I can't synchronize my test steps to
> TreeCache's internal operations effectively.
>
>
> On Sun, Feb 7, 2016 at 10:36 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>
>> Oooh... let me try that.
>>
>> On Sun, Feb 7, 2016 at 8:29 PM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> FYI - I added a new method to Timing to help with this:
>>>
>>> forSessionSleep()
>>>
>>> The new version of KillSession inserts an eventOfDeath directly into the
>>> client. So, no, it’s not a real session kill like it used to be. But, now
>>> it’s more reliable. Use timing.forSessionSleep() to wait for session
>>> timeout.
>>>
>>> -JZ
>>>
>>>
>>> On Feb 7, 2016, at 8:21 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>>>
>>> Can you describe the change then? Because kill session doesn't seem to
>>> now ensure that ephemeral nodes bound the the killed session disappear in a
>>> timely manner
>>> On Feb 7, 2016 8:03 PM, "Jordan Zimmerman" <jordan@jordanzimmerman.com>
>>> wrote:
>>>
>>>> Are we using a new zookeeper?
>>>>
>>>>
>>>> In Curator 3.0 with “new” connection state handling which simulates a
>>>> Session timeout. However, all tests are run twice. First with the old
>>>> handling and then with the new.
>>>>
>>>> Or did something change with our implementation of KillSession.kill()?
>>>>
>>>>
>>>> KillSession did change though. A change I made to ZK got added in 3.5
>>>> and we now use that.
>>>>
>>>> -Jordan
>>>>
>>>> On Feb 7, 2016, at 1:55 PM, Scott Blum <dragonsinth@gmail.com> wrote:
>>>>
>>>> Are we using a new zookeeper?  Or did something change with our
>>>> implementation of KillSession.kill()?
>>>>
>>>> Or maybe there's a timing issue with Curator's ConnectionStateManager
>>>> State change: LOST?  I don't understand how we could get a LOST event
>>>> without the ephemeral node attached to that session having disappeared?
>>>>
>>>> On Sun, Feb 7, 2016 at 10:59 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com> wrote:
>>>>
>>>>> I don’t know if anything changed in ZooKeeper itself. I know that the
>>>>> connection states changed in Curator, but Curator now tests both the
old
>>>>> mode and the new mode and they both fail here.
>>>>>
>>>>> -JZ
>>>>>
>>>>> On Feb 7, 2016, at 1:06 AM, Scott Blum <dragonsinth@gmail.com>
wrote:
>>>>>
>>>>> I need to analyze this a bit deeper, but what I'm seeing on the 3.0
>>>>> branch is that the ephemeral node /test/me created in
>>>>> testKilledSession() really isn't disappearing when it should.
>>>>>
>>>>> After the session loss and the reconnect, /test still shows 2 children
>>>>> [foo, me] and /test/me still returns a node.
>>>>>
>>>>> Any idea why the timing here would have changed?
>>>>>
>>>>> On Sat, Feb 6, 2016 at 1:41 PM, Scott Blum <dragonsinth@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> https://issues.apache.org/jira/browse/CURATOR-302
>>>>>>
>>>>>> I need to trace through what's really going on under the hood rather
>>>>>> than band-aid the test.  Should be able to in next couple of days.
>>>>>>
>>>>>> On Sat, Feb 6, 2016 at 7:43 AM, Jordan Zimmerman <
>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>
>>>>>>> Any update on this? I think you should create a Jira for it.
>>>>>>>
>>>>>>> -Jordan
>>>>>>>
>>>>>>>
>>>>>>> On Feb 5, 2016, at 12:30 PM, Scott Blum <dragonsinth@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> BTW, this test passes on master... so it's some kind of 3.0 vs.
>>>>>>> master issue.  I think I'm going to just have to dump in a ton
of log
>>>>>>> messages and see what differs.
>>>>>>>
>>>>>>> On Fri, Feb 5, 2016 at 12:25 PM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>
>>>>>>>> OK - please create a new Issue in Jira for this.
>>>>>>>>
>>>>>>>> -Jordan
>>>>>>>>
>>>>>>>> On Feb 5, 2016, at 12:24 PM, Scott Blum <dragonsinth@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> BTW: this is broken on CURATOR-3.0 as well, so it appears
to have
>>>>>>>> been broken for a while.  Maybe I'll have to git bisect...
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2016 at 12:22 PM, Scott Blum <dragonsinth@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Okay, so I looked into this for a bit, and I hit kind
of a wall.
>>>>>>>>> I think there is a legit bug/race in TreeCache, and the
following patch
>>>>>>>>> *should* remedy:
>>>>>>>>>
>>>>>>>>> diff --git
>>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> index df4403c..a4a022b 100644
>>>>>>>>> ---
>>>>>>>>> a/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> +++
>>>>>>>>> b/curator-recipes/src/main/java/org/apache/curator/framework/recipes/cache/TreeCache.java
>>>>>>>>> @@ -303,7 +303,6 @@ public class TreeCache implements
Closeable
>>>>>>>>>          void wasDeleted() throws Exception
>>>>>>>>>          {
>>>>>>>>>              ChildData oldChildData = childData.getAndSet(null);
>>>>>>>>> -
>>>>>>>>>  client.watches().remove(this).ofType(WatcherType.Any).locally().inBackground().forPath(path);
>>>>>>>>>              ConcurrentMap<String, TreeNode> childMap
=
>>>>>>>>> children.getAndSet(null);
>>>>>>>>>              if ( childMap != null )
>>>>>>>>>              {
>>>>>>>>> @@ -807,8 +806,16 @@ public class TreeCache implements
Closeable
>>>>>>>>>          case RECONNECTED:
>>>>>>>>>              try
>>>>>>>>>              {
>>>>>>>>> +                outstandingOps.incrementAndGet();
>>>>>>>>>                  root.wasReconnected();
>>>>>>>>>
>>>>>>>>>  publishEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED);
>>>>>>>>> +                if ( outstandingOps.decrementAndGet()
== 0 )
>>>>>>>>> +                {
>>>>>>>>> +                    if ( isInitialized.compareAndSet(false,
true)
>>>>>>>>> )
>>>>>>>>> +                    {
>>>>>>>>> +
>>>>>>>>>  publishEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>> +                    }
>>>>>>>>> +                }
>>>>>>>>>              }
>>>>>>>>>              catch ( Exception e )
>>>>>>>>>              {
>>>>>>>>>
>>>>>>>>> That should guarantee that the initialized event gets
deferred
>>>>>>>>> until all outstanding refreshes finish.. but it's not.
 Something seems to
>>>>>>>>> have changed under the hood in how background events
are getting sent to
>>>>>>>>> TreeCache, and I don't really understand it yet.  And
running the debugger
>>>>>>>>> seems to affect the timing, like something racy is going
on. :(
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2016 at 11:57 AM, Scott Blum <dragonsinth@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Ok, that is kind of weird.  I'll take a look.
>>>>>>>>>>
>>>>>>>>>> On Fri, Feb 5, 2016 at 4:58 AM, Jordan Zimmerman
<
>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> No, sorry. The last few lines of the test currently
are:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
"/test/me", "data".getBytes());
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> This fails. But, if I switch them it works:
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
"/test/me", "data".getBytes());
>>>>>>>>>>>
>>>>>>>>>>> On Feb 5, 2016, at 2:57 AM, Scott Blum <dragonsinth@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> So you end up with 2 initialized events?
>>>>>>>>>>>
>>>>>>>>>>> You mean this?
>>>>>>>>>>>
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.CONNECTION_RECONNECTED)
>>>>>>>>>>> ;
>>>>>>>>>>> +        assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>          assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>
>>>>>>>>>>> Seems weird if there are two, but I can help
look.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Feb 4, 2016 at 10:48 PM, Jordan Zimmerman
<
>>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey Scott,
>>>>>>>>>>>>
>>>>>>>>>>>> In this branch, TestTreeCache.testKilledSession()
is failing at:
>>>>>>>>>>>>
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>>
>>>>>>>>>>>> However, if I change the two asserts to:
>>>>>>>>>>>>
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.INITIALIZED);
>>>>>>>>>>>>         assertEvent(TreeCacheEvent.Type.NODE_REMOVED,
>>>>>>>>>>>> "/test/me", "data".getBytes());
>>>>>>>>>>>>
>>>>>>>>>>>> it works. Does that make any sense?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jordan
>>>>>>>>>>>>
>>>>>>>>>>>> > On Feb 4, 2016, at 9:23 PM, Jordan Zimmerman
<
>>>>>>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>>>>>>> >
>>>>>>>>>>>> > Devs,
>>>>>>>>>>>> >
>>>>>>>>>>>> > In trying to fix the bad log message
"Failed to find watcher”
>>>>>>>>>>>> (which turns out to be a ZK client issue),
I realize that the
>>>>>>>>>>>> NamespaceWatcher and WatcherWrapper stuff
could be improved. I’m still
>>>>>>>>>>>> working on getting all tests to pass but
I’d appreciate more sets of eyes
>>>>>>>>>>>> on this change. Please review carefully if
you can.
>>>>>>>>>>>> >
>>>>>>>>>>>> > https://github.com/apache/curator/pull/131
>>>>>>>>>>>> >
>>>>>>>>>>>> > -Jordan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message