geode-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiaojian Zhou <gz...@pivotal.io>
Subject Re: 2 minute gateway startup time due to GEODE-5591
Date Wed, 05 Sep 2018 17:43:23 GMT
The fix intend to resolve 2 issues:
1) change the exception handling (for a linux version).
2) prevent random picking port number to loop forever. In old code, for
example, if the range only contains one port, random will always pick the
same port and it will loop forever. The fix will stop after all available
ports in the range are tried. There's a test

test_ValidateGatewayReceiverAttributes_WrongBindAddress


For 2-minute-wait, it's still possible. The fix did not resolve it
(when random() happened to return same port for different receiver in
the same member), but I did not make things worse either.


There's discussion on if we can reduce the 2-minute-timeout to a few
second. This is definitely another ticket.

Regards

Gester


On Wed, Sep 5, 2018 at 10:35 AM, Anthony Baker <abaker@pivotal.io> wrote:

> Before this improvement is re-merged I’d like to see:
>
> 1) A test that characterizes the current behavior (e.g. doesn’t wait 2 min
> when there’s a port conflict)
> 2) A test that demonstrates how the current logic is insufficient
>
> Anthony
>
>
> > On Sep 5, 2018, at 10:20 AM, Nabarun Nag <nnag@apache.org> wrote:
> >
> > GEODE-5591 has been reverted in develop
> > ref: 901da27f227a8ce2b7d6b681619782a1accd9330
> >
> > Regards
> > Nabarun Nag
> >
> > On Wed, Sep 5, 2018 at 10:14 AM Ryan McMahon <rmcmahon@pivotal.io>
> wrote:
> >
> >> +1 for reverting in both places.
> >>
> >> I see that there is already an isGatewayReceiver flag in the
> AcceptorImpl
> >> constructor.  It's not ideal, but could we use this flag to prevent the
> 2
> >> minute retry logic for happening if this flag is true?
> >>
> >> Ryan
> >>
> >> On Wed, Sep 5, 2018 at 10:01 AM, Lynn Hughes-Godfrey <
> >> lhughesgodfrey@pivotal.io> wrote:
> >>
> >>> +1 for reverting in both places.
> >>>
> >>> On Wed, Sep 5, 2018 at 9:50 AM, Dan Smith <dsmith@pivotal.io> wrote:
> >>>
> >>>> +1 for reverting in both places. The current fix is not better, that's
> >>> why
> >>>> we are reverting it on the release branch!
> >>>>
> >>>> -Dan
> >>>>
> >>>> On Wed, Sep 5, 2018 at 9:47 AM, Jacob Barrett <jbarrett@pivotal.io>
> >>> wrote:
> >>>>
> >>>>> I’m not ok with reverting in develop. Revert in 1.7 and modify
in
> >>>> develop.
> >>>>> We shouldn’t go backwards in develop. The current fix is better
than
> >>> the
> >>>>> bug it fixes.
> >>>>>
> >>>>>> On Sep 5, 2018, at 9:40 AM, Nabarun Nag <nnag@apache.org>
wrote:
> >>>>>>
> >>>>>> If everyone is okay with it, I will revert that change in develop
> >> and
> >>>>> then
> >>>>>> cherry pick it to release/1.7.0 branch.
> >>>>>> Please do comment.
> >>>>>>
> >>>>>> Regards
> >>>>>> Nabarun Nag
> >>>>>>
> >>>>>>
> >>>>>>> On Wed, Sep 5, 2018 at 9:30 AM Dan Smith <dsmith@pivotal.io>
> >> wrote:
> >>>>>>>
> >>>>>>> +1 to yank it and rework the fix.
> >>>>>>>
> >>>>>>> Gester's change helps, but it just means that you will sometimes
> >>>>> randomly
> >>>>>>> have a 2 minute delay starting up a gateway receiver. I
don't
> >> think
> >>>>> that is
> >>>>>>> a great user experience either.
> >>>>>>>
> >>>>>>> -Dan
> >>>>>>>
> >>>>>>> On Wed, Sep 5, 2018 at 8:20 AM, Bruce Schuchardt <
> >>>>> bschuchardt@pivotal.io>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Let's yank it
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On 9/4/18 5:04 PM, Sean Goller wrote:
> >>>>>>>>>
> >>>>>>>>> If it's to get the release out, I'm fine with reverting.
I don't
> >>>> like
> >>>>>>> it,
> >>>>>>>>> but I'm not willing to die on that hill. :)
> >>>>>>>>>
> >>>>>>>>> -S.
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 4, 2018 at 4:38 PM Dan Smith <dsmith@pivotal.io>
> >>> wrote:
> >>>>>>>>>
> >>>>>>>>> Spitting this into a separate thread.
> >>>>>>>>>>
> >>>>>>>>>> I see the issue. The two minute timeout is the
constructor for
> >>>>>>>>>> AcceptorImpl, where it retries to bind for 2
minutes.
> >>>>>>>>>>
> >>>>>>>>>> That behavior makes sense for CacheServer.start.
> >>>>>>>>>>
> >>>>>>>>>> But it doesn't make sense for the new logic
in
> >>>>> GatewayReceiver.start()
> >>>>>>>>>> from
> >>>>>>>>>> GEODE-5591. That code is trying to use CacheServer.start
to
> >> scan
> >>>> for
> >>>>> an
> >>>>>>>>>> available port, trying each port in a range.
That free port
> >>> finding
> >>>>>>> logic
> >>>>>>>>>> really doesn't want to have two minutes of retries
for each
> >> port.
> >>>> It
> >>>>>>>>>> seems
> >>>>>>>>>> like we need to rework the fix for GEODE-5591.
> >>>>>>>>>>
> >>>>>>>>>> Does it make sense to hold up the release to
rework this fix,
> >> or
> >>>>> should
> >>>>>>>>>> we
> >>>>>>>>>> just revert it? Have we switched concourse over
to using alpine
> >>>>> linux,
> >>>>>>>>>> which I think was the original motivation for
this fix?
> >>>>>>>>>>
> >>>>>>>>>> -Dan
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Sep 4, 2018 at 4:25 PM, Dan Smith <dsmith@pivotal.io>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Why is it waiting at all in this case? Where
is this 2 minute
> >>>> timeout
> >>>>>>>>>>> coming from?
> >>>>>>>>>>>
> >>>>>>>>>>> -Dan
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Sep 4, 2018 at 4:12 PM, Sai Boorlagadda
<
> >>>>>>>>>>>
> >>>>>>>>>> sai.boorlagadda@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>> So the issue is that it takes longer
to start than previous
> >>>>> releases?
> >>>>>>>>>>>> Also, is this wait time only when using
Gfsh to create
> >>>>>>>>>>>> gateway-receiver?
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Sep 4, 2018 at 4:03 PM Nabarun
Nag <nnag@apache.org>
> >>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Currently we have a minor issue in the
release branch as
> >>> pointed
> >>>>> out
> >>>>>>>>>>>>>
> >>>>>>>>>>>> by
> >>>>>>>>>>
> >>>>>>>>>>> Barry O.
> >>>>>>>>>>>>> We will wait till a resolution is
figured out for this
> >> issue.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Steps:
> >>>>>>>>>>>>> 1. create locator
> >>>>>>>>>>>>> 2. start server --name=server1 --server-port=40404
> >>>>>>>>>>>>> 3. start server --name=server2 --server-port=40405
> >>>>>>>>>>>>> 4. create gateway-receiver --member=server1
> >>>>>>>>>>>>> 5. create gateway-receiver --member=server2
`This gets stuck
> >>>> for 2
> >>>>>>>>>>>>>
> >>>>>>>>>>>> minutes`
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Is the 2 minute wait time acceptable?
Should we document it?
> >>>> When
> >>>>> we
> >>>>>>>>>>>>>
> >>>>>>>>>>>> revert
> >>>>>>>>>>>>
> >>>>>>>>>>>>> GEODE-5591, this issue does not
happen.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards
> >>>>>>>>>>>>> Nabarun Nag
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message