geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Udo Kohlmeyer <ukohlme...@gmail.com>
Subject Re: Proposal to Include GEODE-7079 in 1.10.0
Date Thu, 15 Aug 2019 19:58:26 GMT
@Dan, not arguing that logs filling up with NPE's could bring a system 
down with limit disk space, or potentially swallowing important logs 
that could be helpful in root-causing issues...

I'm merely raising the question on why this bug fix should receive 
priority inclusion. It has been around for a long time and could be 
included in a subsequent release.

On 8/15/19 12:34 PM, Dan Smith wrote:
> +1 to merging Juan's fix for GEODE-7079. I've seen systems taken down by
> rapidly filling up the logs in the past, this does seem to be a critical
> fix from the perspective of system stability.
>
> Also, this is the change, which doesn't seem particularly risky to me.
>
> -          ConflationKey key = new
> ConflationKey(gsEvent.getRegion().getFullPath(),
> +          ConflationKey key = new ConflationKey(gsEvent.getRegionPath(),
>
> -Dan
>
> On Thu, Aug 15, 2019 at 12:23 PM Udo Kohlmeyer <udo@apache.com> wrote:
>
>> Whilst I agree with "*finish* when we believe the quality of the release
>> branch is sufficient", I disagree that we should have cut a branch and
>> continue to patch that branch with non-critical fixes. i.e this issue
>> has been around for a while and has no averse side effects. Issues like
>> GEODE-7081, which is new due to a new commit, AND it has critical
>> stability implications on the server, that I can agree we should include
>> in a potential release branch.
>>
>> Otherwise we can ALWAYS argue that said release branch is not of
>> "sufficient" quality, especially if there are numerous existing JIRA's
>> pertaining to bugs already in the system.
>>
>> To quote Juan's original email:
>>
>> /"Note: *no events are lost (even without the fix)* but, if the region
>> takes//
>> //a while to recover, the logs  for the member can grow pretty quickly
>> due to//
>> //the continuously thrown *NPEs.*"/
>>
>> In addition to this, if there is a commit in a cut release branch, which
>> is requiring us to continuously patching the release branch, in order to
>> stabilize that feature/fix, maybe we should consider reverting that fix
>> and release it at a later stage, when it is believed that this fix is
>> more stable and have better, more comprehensive test coverage.
>>
>> So far, GEODE-7081, does not have me convinced that it is critical. OR
>> maybe it is the latter of my options, where it is a stabilization commit
>> to a new feature, which begs the question, should we have accepted the
>> original feature commit if there are all manner of side effects which we
>> are only discovering.
>>
>> --Udo
>>
>> On 8/15/19 11:08 AM, Anthony Baker wrote:
>>> While we can’t fix *all known bugs*, I think where we do have a fix for
>> an important issue we should think hard about the cost of not including
>> that in a release.
>>> IMO, the fixed time approach to releases means that we *start* the
>> release effort (including stabilization and bug fixing if needed) on a
>> known date and we *finish* when new believe the quality of the release
>> branch is sufficient.  Given the number of important fixes being requested,
>> I’m not sure we are there yet.
>>> I think the release branch concept has merit because it allows us to
>> isolate ongoing work from the changes needed for a release.
>>> +1 for including GEODE-7079.
>>>
>>> Anthony
>>>
>>>
>>>> On Aug 15, 2019, at 10:51 AM, Udo Kohlmeyer <ukohlmeyer@gmail.com>
>> wrote:
>>>> Seems everyone is in favor or including a /*non-critical*/ fix to an
>> already cut branch of the a potential release...
>>>> Am I missing something?
>>>>
>>>> Why cut a release at all... just have a perpetual cycle of fixes added
>> to develop and users can chose what nightly snapshot build they would want
>> to use..
>>>> I'm voting -1 on a non-critical issue, which is existing and worst
>> effect is to fill logs will NPE logs... (yes, not something we want).
>>>> I believed that we (as a Geode community) agreed that once a release
>> has been cut, only critical issue fixes will be included. If we continue
>> just continually adding to the ALREADY CUT 1.10 release, where do we stop
>> and when do we release...
>>>> --Udo
>>>>
>>>> On 8/15/19 10:19 AM, Nabarun Nag wrote:
>>>>> +1
>>>>>
>>>>> On Thu, Aug 15, 2019 at 10:15 AM Alexander Murmann <
>> amurmann@apache.org>
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Agreed to fixing this. It's impossible for a user to discover they
>> hit an
>>>>>> edge case that we fail to support till they are in prod and restart.
>>>>>>
>>>>>> On Thu, Aug 15, 2019 at 10:09 AM Juan José Ramos <jramos@pivotal.io>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Udo,
>>>>>>>
>>>>>>> Even if it is an existing issue I'd still consider it critical
for
>> those
>>>>>>> cases on which there are unprocessed events on the persistent
queue
>>>>>> after a
>>>>>>> restart and the region takes long to recover... you can actually
see
>>>>>>> millions of *NPEs* flooding the member's logs.
>>>>>>> My two cents anyway, it's up to the community to make the final
>> decision.
>>>>>>> Cheers.
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Aug 15, 2019 at 5:58 PM Udo Kohlmeyer <udo@apache.com>
>> wrote:
>>>>>>>> Juan,
>>>>>>>>
>>>>>>>>    From your explanation, it seems this issue is existing
and not
>>>>>>>> critical. Could we possibly hold this for 1.11?
>>>>>>>>
>>>>>>>> --Udo
>>>>>>>>
>>>>>>>> On 8/15/19 5:29 AM, Ju@N wrote:
>>>>>>>>> Hello team,
>>>>>>>>>
>>>>>>>>> I'd like to propose including the *fix [1]* for *GEODE-7079
[2]* in
>>>>>>>> release
>>>>>>>>> 1.10.0.
>>>>>>>>> Long story short: a *NullPointerException* can be continuously
>> thrown
>>>>>>>>> and flood the member's logs if a serial event processor
(either
>>>>>>>>> *async-event-queue* or *gateway-sender*) starts processing
events
>>>>>> from
>>>>>>> a
>>>>>>>>> recovered persistent queue before the actual region to
which it was
>>>>>>>>> attached is fully operational.
>>>>>>>>> Note: *no events are lost (even without the fix)* but,
if the
>> region
>>>>>>>> takes
>>>>>>>>> a while to recover, the logs  for the member can grow
pretty
>> quickly
>>>>>>> due
>>>>>>>> to
>>>>>>>>> the continuously thrown *NPEs.*
>>>>>>>>> Best regards.
>>>>>>>>>
>>>>>>>>> [1]:
>>>>>>>>>
>> https://github.com/apache/geode/commit/6f4bbbd96bcecdb82cf7753ce1dae9fa6baebf9b
>>>>>>>>> [2]: https://issues.apache.org/jira/browse/GEODE-7079
>>>>>>>>>
>>>>>>> --
>>>>>>> Juan José Ramos Cassella
>>>>>>> Senior Software Engineer
>>>>>>> Email: jramos@pivotal.io
>>>>>>>

Mime
View raw message