curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: CURATOR-3.0 tests
Date Fri, 10 Jun 2016 06:34:46 GMT
Sorry - I’ve been tied up. I’ll take a look tomorrow.

-Jordan

> On Jun 8, 2016, at 5:33 PM, Cameron McKenzie <mckenzie.cam@gmail.com> wrote:
> 
> Does anyone have any thoughts on this?
> 
> Either,
> -I'm misinterpreting something, and we just need to increase the wait time
> / decrease the session timeout on the CURATOR-335 test.
> or
> -The way that the new connection state manager works is incorrect and needs
> to be fixed.
> 
> I'm leaning to the second one. My expectation would be that a LOST event
> would be generated after session timeout MS had passed with no connection
> to ZK. Not 4/3 of the session timeout MS.
> 
> 
> 
> On Tue, Jun 7, 2016 at 5:13 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
> wrote:
> 
>> I think that the problem is that the processEvents() loop in the
>> ConnectionStateManager class checks for timeouts while in a suspended state
>> only every 2/3 of the negotiated session length.
>> 
>> I think that the test in CURATOR-335 is perhaps a bit strange in that it
>> uses the same session timeout as the amount of time it will wait for the
>> session to timeout to occur (+ a sleepForABitCall()), so there's not much
>> room for error. That's why it has exposed this issue.
>> 
>> The session timeout is set to 50 seconds.
>> 
>> The initial SUSPEND event occurs.
>> Then 50 * (2/3) seconds later the processEvents() loop wakes up again
>> after no events have occurred, so checks timeouts. It still hasn't timed
>> out, so it polls again for another 50 * (2/3) seconds. There are no
>> additional events in this time, which means that the next timeout check
>> doesn't occur until 66 seconds (100 * (2/3)) instead of after 50 seconds as
>> you would expect.
>> 
>> If I set the assertions to wait for up to 70 seconds for the appropriate
>> state to be achieved, then the test passes fine.
>> 
>> The poll call in the processEvents loop must take into account how much
>> time has already been spent in a suspended state.
>> 
>> Jordan, do you remember what the rationale behind only waiting for 2/3 of
>> the session timeout is? It's something to do with the way that ZK itself
>> handles session timeouts isn't it? Does ZK timeout the session if it hasn't
>> received a heartbeat for 2/3 of the session timeout? I can't remember.
>> cheers
>> 
>> 
>> 
>> 
>> On Tue, Jun 7, 2016 at 10:41 AM, Cameron McKenzie <mckenzie.cam@gmail.com>
>> wrote:
>> 
>>> Seems like I have uncovered another problem on the 3.0 branch.
>>> 
>>> It looks like the new (ish) connection handling stuff doesn't seem to be
>>> working correctly for long session timeouts. Specifically, the test for
>>> CURATOR-335 fails on the 3.0 branch when run with the new connection
>>> handling, but works with the 'classic' connection handling.
>>> 
>>> It fails when asserting that the LOST event occurs after the server is
>>> stopped.
>>> 
>>> I'm not going to have time to do much more digging for at least today,
>>> but I have made a more targeted test case:
>>> 
>>> TestFramework:testSessionLossWithLongTimeout on
>>> the long_session_timeout_issue branch.
>>> 
>>> if anyone has time to look before I do.
>>> 
>>> I think that this needs to be resolved before 3.0 can be released.
>>> cheers
>>> 
>>> 
>>> On Mon, Jun 6, 2016 at 9:49 AM, Jordan Zimmerman <
>>> jordan@jordanzimmerman.com> wrote:
>>> 
>>>> :D
>>>> 
>>>>> Is it worth holding up the build to merge CURATOR-331?
>>>> No, let’s go with what we have.
>>>> 
>>>>> On Jun 5, 2016, at 6:48 PM, Cameron McKenzie <mckenzie.cam@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Ah, must still be recovering, I'm sure I saw it was being applied to
>>>> the
>>>>> 3.0 branch.
>>>>> 
>>>>> I will merge it into master and 3.0.
>>>>> 
>>>>> Is it worth holding up the build to merge CURATOR-331? I have asked
>>>> Scott
>>>>> what his opinion is since its the TreeCache stuff. It looks ok to me
>>>> though.
>>>>> cheers
>>>>> 
>>>>> On Mon, Jun 6, 2016 at 9:44 AM, Jordan Zimmerman <
>>>> jordan@jordanzimmerman.com
>>>>>> wrote:
>>>>> 
>>>>>> Yes, that’s correct. It’s a patch against master. I’ll do the
merge if
>>>>>> you’re OK with it.
>>>>>> 
>>>>>> -Jordan
>>>>>> 
>>>>>>> On Jun 5, 2016, at 6:42 PM, Cameron McKenzie <mckenzie.cam@gmail.com
>>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> hey Jordan,
>>>>>>> The fix for CURATOR-335 looks good to me, but I'm wondering if
it
>>>> should
>>>>>>> actually be applied against master and then merged into 3.0?
>>>>>>> 
>>>>>>> On Fri, Jun 3, 2016 at 12:21 PM, Jordan Zimmerman <
>>>>>>> jordan@jordanzimmerman.com> wrote:
>>>>>>> 
>>>>>>>> no worries - get well.
>>>>>>>> 
>>>>>>>>> On Jun 2, 2016, at 9:20 PM, Cameron McKenzie <
>>>> mckenzie.cam@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Thanks for sorting this out Jordan. I'm pretty sick today
so won't
>>>> get
>>>>>>>>> around to looking at it, but I will try over the weekend
or really
>>>> next
>>>>>>>> week
>>>>>>>>> On 3 Jun 2016 7:05 AM, "Jordan Zimmerman" <
>>>> jordan@jordanzimmerman.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>>> It sounds like curator is using a different algorithm
since it
>>>> has
>>>>>>>>>>> nodes sorting their position to determine if
they have a lease or
>>>>>> not.
>>>>>>>>>> 
>>>>>>>>>> No - I just added that as I thought there was a bug.
But, now I
>>>>>> realize
>>>>>>>>>> I’m wrong. So, it was correct all along. Thanks
Ben.
>>>>>>>>>> 
>>>>>>>>>> -Jordan
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message