geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Udo Kohlmeyer <...@apache.com>
Subject Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0
Date Mon, 26 Aug 2019 23:22:32 GMT
Thank you Ryan,

+1 for inclusion

On 8/26/19 3:33 PM, Ryan McMahon wrote:
> Udo,
>
> Here are inline answers to your questions:
>
> *Is this an existing issue?*
>
> Short answer - yes, but it has never been in a release version of Geode.
> The leak was introduced as part of some changes to address handling
> multiple concurrent registration requests for a given client on a single
> server.  The issue is only seen if client registration fails which is not
> particularly common, which is why we are only seeing it now after months of
> testing.  The commit for that was introduced here on April 30th.
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> The ConcurrentModificationException issue (which ultimately causes the
> registration to fail) was introduced on April 22nd here:
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
>
>
> *Why is it more critical to squeeze it into an existing (almost
> release) version of Apache Geode?*
>
> Not sure I totally understand this question, but it is critical because the
> leak can cause servers to crash due to OOM.  Again, because the problems
> were introduced in late April and we haven't released Geode since then, so
> I think it is very important to merge these fixes into 1.10.0 before we
> release.
>
>
>
> *What guarantees do we have that this fix makes the application more stable
> compared to adding another hidden issue, which we will discover in another
> few weeks from now?*
>
> I added numerous tests for this scenario to ensure that the leak would
> never happen regardless of the cause of a failed client registration.
> There obviously is no way to 100% guarantee that there will be no issues
> that arise from the fixes themselves, but our existing test coverage plus
> the new tests I wrote should give us the confidence we need.
>
> Thanks,
> Ryan
>
> On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <udo@apache.com> wrote:
>
>> In order to better understand this request:
>>
>> Is this an existing issue?
>>
>> Why is it more critical to squeeze it into an existing (almost release)
>> version of Apache Geode?
>>
>> What guarantees do we have that this fix makes the application more
>> stable compared to adding another hidden issue, which we will discover
>> in another few weeks from now?
>>
>>
>> --Udo
>>
>> On 8/26/19 3:10 PM, Ryan McMahon wrote:
>>> Hi all,
>>>
>>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
>>> 1.10.0 release branch.  The two JIRAs are related to the same root
>> problem,
>>> which I would classify as critical.  We discovered a case where a failed
>>> client registration could lead to a memory leak in a server, eventually
>>> causing the server to crash due to lack of memory.
>>>
>>> The issue is instigated by a ConcurrentModificationException due to
>>> iteration of a non-thread safe collection while it is being mutated
>>> (GEODE-7088).  This exception occurs when the client's queue image is
>> being
>>> copied from one server to the next during client registration, and it
>>> causes the client's registration to fail.  The client would likely
>> succeed
>>> if it retried registration with that same server, but if it registers
>> with
>>> a different server, we end up leaking events to the client's registration
>>> queue on the original server (GEODE-7089).
>>>
>>> The fix for GEODE-7088 is to use thread-safe collections for interested
>>> clients in client update messages.  The fix for GEODE-7089 is to always
>>> drain and remove the registration queue regardless of success or failure.
>>> Together, these fixes prevent the failed registrations and memory leak.
>>>
>>> The SHAs for the fixes and tests in develop are:
>>>
>>> GEODE-7088
>>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
>>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
>>>
>>> GEODE-7089
>>> - 5d0153ad4adb1612a1083673f98b1982819a6589
>>>
>>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
>>>
>>> Thanks,
>>> Ryan McMahon
>>>

Mime
View raw message