curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cameron McKenzie <mckenzie....@gmail.com>
Subject Re: CURATOR-3.0 tests
Date Wed, 08 Jun 2016 22:33:23 GMT
Does anyone have any thoughts on this?

Either,
-I'm misinterpreting something, and we just need to increase the wait time
/ decrease the session timeout on the CURATOR-335 test.
or
-The way that the new connection state manager works is incorrect and needs
to be fixed.

I'm leaning to the second one. My expectation would be that a LOST event
would be generated after session timeout MS had passed with no connection
to ZK. Not 4/3 of the session timeout MS.



On Tue, Jun 7, 2016 at 5:13 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
wrote:

> I think that the problem is that the processEvents() loop in the
> ConnectionStateManager class checks for timeouts while in a suspended state
> only every 2/3 of the negotiated session length.
>
> I think that the test in CURATOR-335 is perhaps a bit strange in that it
> uses the same session timeout as the amount of time it will wait for the
> session to timeout to occur (+ a sleepForABitCall()), so there's not much
> room for error. That's why it has exposed this issue.
>
> The session timeout is set to 50 seconds.
>
> The initial SUSPEND event occurs.
> Then 50 * (2/3) seconds later the processEvents() loop wakes up again
> after no events have occurred, so checks timeouts. It still hasn't timed
> out, so it polls again for another 50 * (2/3) seconds. There are no
> additional events in this time, which means that the next timeout check
> doesn't occur until 66 seconds (100 * (2/3)) instead of after 50 seconds as
> you would expect.
>
> If I set the assertions to wait for up to 70 seconds for the appropriate
> state to be achieved, then the test passes fine.
>
> The poll call in the processEvents loop must take into account how much
> time has already been spent in a suspended state.
>
> Jordan, do you remember what the rationale behind only waiting for 2/3 of
> the session timeout is? It's something to do with the way that ZK itself
> handles session timeouts isn't it? Does ZK timeout the session if it hasn't
> received a heartbeat for 2/3 of the session timeout? I can't remember.
> cheers
>
>
>
>
> On Tue, Jun 7, 2016 at 10:41 AM, Cameron McKenzie <mckenzie.cam@gmail.com>
> wrote:
>
>> Seems like I have uncovered another problem on the 3.0 branch.
>>
>> It looks like the new (ish) connection handling stuff doesn't seem to be
>> working correctly for long session timeouts. Specifically, the test for
>> CURATOR-335 fails on the 3.0 branch when run with the new connection
>> handling, but works with the 'classic' connection handling.
>>
>> It fails when asserting that the LOST event occurs after the server is
>> stopped.
>>
>> I'm not going to have time to do much more digging for at least today,
>> but I have made a more targeted test case:
>>
>> TestFramework:testSessionLossWithLongTimeout on
>> the long_session_timeout_issue branch.
>>
>> if anyone has time to look before I do.
>>
>> I think that this needs to be resolved before 3.0 can be released.
>> cheers
>>
>>
>> On Mon, Jun 6, 2016 at 9:49 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com> wrote:
>>
>>> :D
>>>
>>> > Is it worth holding up the build to merge CURATOR-331?
>>> No, let’s go with what we have.
>>>
>>> > On Jun 5, 2016, at 6:48 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
>>> wrote:
>>> >
>>> > Ah, must still be recovering, I'm sure I saw it was being applied to
>>> the
>>> > 3.0 branch.
>>> >
>>> > I will merge it into master and 3.0.
>>> >
>>> > Is it worth holding up the build to merge CURATOR-331? I have asked
>>> Scott
>>> > what his opinion is since its the TreeCache stuff. It looks ok to me
>>> though.
>>> > cheers
>>> >
>>> > On Mon, Jun 6, 2016 at 9:44 AM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com
>>> >> wrote:
>>> >
>>> >> Yes, that’s correct. It’s a patch against master. I’ll do the
merge if
>>> >> you’re OK with it.
>>> >>
>>> >> -Jordan
>>> >>
>>> >>> On Jun 5, 2016, at 6:42 PM, Cameron McKenzie <mckenzie.cam@gmail.com
>>> >
>>> >> wrote:
>>> >>>
>>> >>> hey Jordan,
>>> >>> The fix for CURATOR-335 looks good to me, but I'm wondering if it
>>> should
>>> >>> actually be applied against master and then merged into 3.0?
>>> >>>
>>> >>> On Fri, Jun 3, 2016 at 12:21 PM, Jordan Zimmerman <
>>> >>> jordan@jordanzimmerman.com> wrote:
>>> >>>
>>> >>>> no worries - get well.
>>> >>>>
>>> >>>>> On Jun 2, 2016, at 9:20 PM, Cameron McKenzie <
>>> mckenzie.cam@gmail.com>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Thanks for sorting this out Jordan. I'm pretty sick today
so won't
>>> get
>>> >>>>> around to looking at it, but I will try over the weekend
or really
>>> next
>>> >>>> week
>>> >>>>> On 3 Jun 2016 7:05 AM, "Jordan Zimmerman" <
>>> jordan@jordanzimmerman.com>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>>>> It sounds like curator is using a different algorithm
since it
>>> has
>>> >>>>>>> nodes sorting their position to determine if they
have a lease or
>>> >> not.
>>> >>>>>>
>>> >>>>>> No - I just added that as I thought there was a bug.
But, now I
>>> >> realize
>>> >>>>>> I’m wrong. So, it was correct all along. Thanks Ben.
>>> >>>>>>
>>> >>>>>> -Jordan
>>> >>>>
>>> >>>>
>>> >>
>>> >>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message