curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cameron McKenzie <mckenzie....@gmail.com>
Subject Re: CURATOR-3.0 tests
Date Tue, 07 Jun 2016 07:13:25 GMT
I think that the problem is that the processEvents() loop in the
ConnectionStateManager class checks for timeouts while in a suspended state
only every 2/3 of the negotiated session length.

I think that the test in CURATOR-335 is perhaps a bit strange in that it
uses the same session timeout as the amount of time it will wait for the
session to timeout to occur (+ a sleepForABitCall()), so there's not much
room for error. That's why it has exposed this issue.

The session timeout is set to 50 seconds.

The initial SUSPEND event occurs.
Then 50 * (2/3) seconds later the processEvents() loop wakes up again after
no events have occurred, so checks timeouts. It still hasn't timed out, so
it polls again for another 50 * (2/3) seconds. There are no additional
events in this time, which means that the next timeout check doesn't occur
until 66 seconds (100 * (2/3)) instead of after 50 seconds as you would
expect.

If I set the assertions to wait for up to 70 seconds for the appropriate
state to be achieved, then the test passes fine.

The poll call in the processEvents loop must take into account how much
time has already been spent in a suspended state.

Jordan, do you remember what the rationale behind only waiting for 2/3 of
the session timeout is? It's something to do with the way that ZK itself
handles session timeouts isn't it? Does ZK timeout the session if it hasn't
received a heartbeat for 2/3 of the session timeout? I can't remember.
cheers




On Tue, Jun 7, 2016 at 10:41 AM, Cameron McKenzie <mckenzie.cam@gmail.com>
wrote:

> Seems like I have uncovered another problem on the 3.0 branch.
>
> It looks like the new (ish) connection handling stuff doesn't seem to be
> working correctly for long session timeouts. Specifically, the test for
> CURATOR-335 fails on the 3.0 branch when run with the new connection
> handling, but works with the 'classic' connection handling.
>
> It fails when asserting that the LOST event occurs after the server is
> stopped.
>
> I'm not going to have time to do much more digging for at least today, but
> I have made a more targeted test case:
>
> TestFramework:testSessionLossWithLongTimeout on
> the long_session_timeout_issue branch.
>
> if anyone has time to look before I do.
>
> I think that this needs to be resolved before 3.0 can be released.
> cheers
>
>
> On Mon, Jun 6, 2016 at 9:49 AM, Jordan Zimmerman <
> jordan@jordanzimmerman.com> wrote:
>
>> :D
>>
>> > Is it worth holding up the build to merge CURATOR-331?
>> No, let’s go with what we have.
>>
>> > On Jun 5, 2016, at 6:48 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
>> wrote:
>> >
>> > Ah, must still be recovering, I'm sure I saw it was being applied to the
>> > 3.0 branch.
>> >
>> > I will merge it into master and 3.0.
>> >
>> > Is it worth holding up the build to merge CURATOR-331? I have asked
>> Scott
>> > what his opinion is since its the TreeCache stuff. It looks ok to me
>> though.
>> > cheers
>> >
>> > On Mon, Jun 6, 2016 at 9:44 AM, Jordan Zimmerman <
>> jordan@jordanzimmerman.com
>> >> wrote:
>> >
>> >> Yes, that’s correct. It’s a patch against master. I’ll do the merge
if
>> >> you’re OK with it.
>> >>
>> >> -Jordan
>> >>
>> >>> On Jun 5, 2016, at 6:42 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
>> >> wrote:
>> >>>
>> >>> hey Jordan,
>> >>> The fix for CURATOR-335 looks good to me, but I'm wondering if it
>> should
>> >>> actually be applied against master and then merged into 3.0?
>> >>>
>> >>> On Fri, Jun 3, 2016 at 12:21 PM, Jordan Zimmerman <
>> >>> jordan@jordanzimmerman.com> wrote:
>> >>>
>> >>>> no worries - get well.
>> >>>>
>> >>>>> On Jun 2, 2016, at 9:20 PM, Cameron McKenzie <
>> mckenzie.cam@gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> Thanks for sorting this out Jordan. I'm pretty sick today so
won't
>> get
>> >>>>> around to looking at it, but I will try over the weekend or
really
>> next
>> >>>> week
>> >>>>> On 3 Jun 2016 7:05 AM, "Jordan Zimmerman" <
>> jordan@jordanzimmerman.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>>> It sounds like curator is using a different algorithm
since it has
>> >>>>>>> nodes sorting their position to determine if they have
a lease or
>> >> not.
>> >>>>>>
>> >>>>>> No - I just added that as I thought there was a bug. But,
now I
>> >> realize
>> >>>>>> I’m wrong. So, it was correct all along. Thanks Ben.
>> >>>>>>
>> >>>>>> -Jordan
>> >>>>
>> >>>>
>> >>
>> >>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message