geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen Nichols <onich...@pivotal.io>
Subject Re: I propose including the fix for GEODE-3780 in 1.10
Date Sat, 17 Aug 2019 09:07:03 GMT
Hi Bruce, thank you for bringing your concern.

Geode's release process dictates a time-based schedule <https://cwiki.apache.org/confluence/display/GEODE/Release+Schedule>
to cut release branches.  The release/1.10.0 <https://github.com/apache/geode/tree/release/1.10.0>
branch was already cut 2 weeks ago, but the “critical fixes” rule does allow critical
fixes to be brought to the release branch by proposal on the dev list..

If there is consensus from the Geode community that your proposed fix satisfies the “critical
fixes” rule, I will be happy to bring it to the 1.10.0 release branch.

Regards
- Owen

> On Aug 15, 2019, at 3:38 PM, Bruce Schuchardt <bschuchardt@pivotal.io> wrote:
> 
> In this case it was another change that is in 1.10 that decreased the amount of time
we try to connect to unreachable alert listeners that caused this problem to resurface.  This
decrease allowed availability checks to proceed faster than they used to. This allowed an
availability check to pass and on subsequent suspect initiation we did not process the suspect
event locally, causing the node that should have become coordinator (and declared a network
partition) to just loop endlessly casting suspicion on other nodes but not doing anything
about it.
> 
> So, "yes", we do know what caused it to resurface and that change is only in 1.10.  GEODE-3780
was not correctly fixed before and this 1.10 change made it more likely to occur.
> 
> On 8/15/19 3:03 PM, Udo Kohlmeyer wrote:
>> Looking at the Geode ticket number, it seems this problem has resurfaced, as it seems
to have been addressed in 1.7.0 already.
>> 
>> My concern is, do what know WHAT caused it to resurface? Or was this issue always
dormant and only recently resurfaced? Without understand why we are seeing "fixed" issues
resurfacing, concerns me. As that could mean we have made changes that have adverse effects
and we were really premature in cutting 1.10.
>> 
>> --Udo
>> 
>> On 8/15/19 2:46 PM, Bruce Schuchardt wrote:
>>> Testing in the past week hit this problem 9 times and it was identified as a
new issue.
>>> 
>>> 
>>> On 8/15/19 2:23 PM, Jacob Barrett wrote:
>>>> Because someone will ask, can we be proactive in these request with identifying
if the issue being fixed is introduced in Geode 1.10 or is a preexisting condition.
>>>> 
>>>> -jake
>>>> 
>>>> 
>>>>> On Aug 15, 2019, at 2:09 PM, Bruce Schuchardt <bschuchardt@pivotal.io>
wrote:
>>>>> 
>>>>> This is a fix for a problem where a member that has lost quorum does
not detect it and does not shut down.  The fix is small and has been extensively tested. 
The fix also addresses the possibility of a member being kicked out of the cluster when it
is only late in delivering a heartbeat (i.e., no availability check performed).
>>>>> 
>>>>> SHA: 8e9b04470264983d0aa1c7900f6e9be2374549d9
>>>>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message