geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan McMahon <mcmellaw...@apache.org>
Subject Re: Proposal to include GEODE-7088 and GEODE-7089 in 1.10.0
Date Tue, 27 Aug 2019 02:50:27 GMT
The cherry-pick for GEODE-7088 is clean so I didn't open a PR for that
one.  The cherry-pick for GEODE-7089 required manual merging due to several
unrelated stats changes added to develop recently.  The PR to merge that
one into release/1.10.0 is here:
https://github.com/apache/geode/pull/3976

The original PR for this change when it was added to develop is here:
https://github.com/apache/geode/pull/3929

Thanks all for reading and considering.

Ryan

On Mon, Aug 26, 2019 at 4:22 PM Udo Kohlmeyer <udo@apache.com> wrote:

> Thank you Ryan,
>
> +1 for inclusion
>
> On 8/26/19 3:33 PM, Ryan McMahon wrote:
> > Udo,
> >
> > Here are inline answers to your questions:
> >
> > *Is this an existing issue?*
> >
> > Short answer - yes, but it has never been in a release version of Geode.
> > The leak was introduced as part of some changes to address handling
> > multiple concurrent registration requests for a given client on a single
> > server.  The issue is only seen if client registration fails which is not
> > particularly common, which is why we are only seeing it now after months
> of
> > testing.  The commit for that was introduced here on April 30th.
> >
> https://github.com/apache/geode/commit/bc2a2fa5af374cfedfba4dc1abe6cbc2a7b719c8
> > The ConcurrentModificationException issue (which ultimately causes the
> > registration to fail) was introduced on April 22nd here:
> >
> https://github.com/apache/geode/commit/afc311c04f6908a8f725834cdf2c49ce6e902b3f
> >
> >
> > *Why is it more critical to squeeze it into an existing (almost
> > release) version of Apache Geode?*
> >
> > Not sure I totally understand this question, but it is critical because
> the
> > leak can cause servers to crash due to OOM.  Again, because the problems
> > were introduced in late April and we haven't released Geode since then,
> so
> > I think it is very important to merge these fixes into 1.10.0 before we
> > release.
> >
> >
> >
> > *What guarantees do we have that this fix makes the application more
> stable
> > compared to adding another hidden issue, which we will discover in
> another
> > few weeks from now?*
> >
> > I added numerous tests for this scenario to ensure that the leak would
> > never happen regardless of the cause of a failed client registration.
> > There obviously is no way to 100% guarantee that there will be no issues
> > that arise from the fixes themselves, but our existing test coverage plus
> > the new tests I wrote should give us the confidence we need.
> >
> > Thanks,
> > Ryan
> >
> > On Mon, Aug 26, 2019 at 3:17 PM Udo Kohlmeyer <udo@apache.com> wrote:
> >
> >> In order to better understand this request:
> >>
> >> Is this an existing issue?
> >>
> >> Why is it more critical to squeeze it into an existing (almost release)
> >> version of Apache Geode?
> >>
> >> What guarantees do we have that this fix makes the application more
> >> stable compared to adding another hidden issue, which we will discover
> >> in another few weeks from now?
> >>
> >>
> >> --Udo
> >>
> >> On 8/26/19 3:10 PM, Ryan McMahon wrote:
> >>> Hi all,
> >>>
> >>> I would like to propose cherry-picking GEODE-7088 and GEODE-7089 to the
> >>> 1.10.0 release branch.  The two JIRAs are related to the same root
> >> problem,
> >>> which I would classify as critical.  We discovered a case where a
> failed
> >>> client registration could lead to a memory leak in a server, eventually
> >>> causing the server to crash due to lack of memory.
> >>>
> >>> The issue is instigated by a ConcurrentModificationException due to
> >>> iteration of a non-thread safe collection while it is being mutated
> >>> (GEODE-7088).  This exception occurs when the client's queue image is
> >> being
> >>> copied from one server to the next during client registration, and it
> >>> causes the client's registration to fail.  The client would likely
> >> succeed
> >>> if it retried registration with that same server, but if it registers
> >> with
> >>> a different server, we end up leaking events to the client's
> registration
> >>> queue on the original server (GEODE-7089).
> >>>
> >>> The fix for GEODE-7088 is to use thread-safe collections for interested
> >>> clients in client update messages.  The fix for GEODE-7089 is to always
> >>> drain and remove the registration queue regardless of success or
> failure.
> >>> Together, these fixes prevent the failed registrations and memory leak.
> >>>
> >>> The SHAs for the fixes and tests in develop are:
> >>>
> >>> GEODE-7088
> >>> - 174af1d23fb7e09eb2bc2fa55479df854850fadb
> >>> - 5bb753a8f4ff2886acd8e62d6f51fea58e37881d
> >>>
> >>> GEODE-7089
> >>> - 5d0153ad4adb1612a1083673f98b1982819a6589
> >>>
> >>> This proposal is to cherry-pick these commits to 1.10.0 release branch.
> >>>
> >>> Thanks,
> >>> Ryan McMahon
> >>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message