kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damian Guy <damian....@gmail.com>
Subject Re: [DISCUSS] KIP-134: Delay initial consumer group rebalance
Date Tue, 28 Mar 2017 08:53:10 GMT
All,
I'd like to get this back to the original discussion about Delaying initial
consumer group rebalance.
I think i'm leaning towards sticking with the broker config and changing
the delay so that the timer starts again when a new consumer joins the
group. What are peoples thoughts on that?

Doing something similar on leave is valid, but i'd prefer to consider it
separately from this.

Thanks,
Damian

On Tue, 28 Mar 2017 at 09:48 Damian Guy <damian.guy@gmail.com> wrote:

> Matthias,
>
> Yes i know.
>
> Thanks,
> Damian
>
> On Mon, 27 Mar 2017 at 18:17 Matthias J. Sax <matthias@confluent.io>
> wrote:
>
> Damian,
>
> about "rebalance immediately" on timeout -- I guess, that's a different
> case as no LeaveGroupRequest will be sent. Thus, the broker should be
> able to distinguish both cases easily, and apply the delay only if it
> received the LeaveGroupRequest but not if a consumer times out.
>
> Does this make sense?
>
> -Matthias
>
> On 3/27/17 1:56 AM, Damian Guy wrote:
> > @Becket
> >
> > Thanks for the feedback. Yes, i like the idea of extending the delay as
> > each new consumer joins the group. Though, i think this could be done
> with
> > either a consumer or broker side config. But i get your point that some
> > consumers in the group can be misconfigured.
> >
> > @Matthias & @Eno - yes we could probably do something similar if the
> member
> > has sent the LeaveGroupRequest. I'm not sure it would be valid if the
> > member crashed, hence session.timeout would come into play, we'd probably
> > want to rebalance immediately. I'd be interested in hearing thoughts from
> > other core kafka folks on this one.
> >
> > Thanks,
> > Damian
> >
> >
> >
> > On Fri, 24 Mar 2017 at 23:01 Becket Qin <becket.qin@gmail.com> wrote:
> >
> >> Hi Matthias,
> >>
> >> Yes, that was what I was thinking. We will keep delay it until either
> >> reaching the rebalance timeout or no new consumer joins in that small
> delay
> >> which is configured on the broker side.
> >>
> >> Thanks,
> >>
> >> Jiangjie (Becket) Qin
> >>
> >> On Fri, Mar 24, 2017 at 1:39 PM, Matthias J. Sax <matthias@confluent.io
> >
> >> wrote:
> >>
> >>> @Becket:
> >>>
> >>> I am not sure, if I understand this correctly. Instead of applying a
> >>> fixed delay, that starts when the first consumer of an (empty) group
> >>> joins, you suggest to re-trigger/re-set the delay each time a new
> >>> consumer joins?
> >>>
> >>> This sound like a good strategy to me, if the config is on the broker
> >> side.
> >>>
> >>> @Eno:
> >>>
> >>> I think that's a valid point and I like this idea!
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 3/24/17 1:23 PM, Eno Thereska wrote:
> >>>> Thanks Damian,
> >>>>
> >>>> This KIP deals with the initial phase only. What about the cases when
> >>> several consumers leave a group? Won't there be several expensive
> >>> rebalances then as well? I'm wondering if it makes sense for the delay
> to
> >>> hold anytime the "set" of consumers in a group changes, be it addition
> to
> >>> the group or removal from group.
> >>>>
> >>>> Thanks
> >>>> Eno
> >>>>
> >>>>
> >>>>> On 24 Mar 2017, at 20:04, Becket Qin <becket.qin@gmail.com>
wrote:
> >>>>>
> >>>>> Thanks for the KIP, Damian.
> >>>>>
> >>>>> My two cents on this. It seems there are two things worth thinking
> >> here:
> >>>>>
> >>>>> 1. Better rebalance timing. We will try to rebalance only when all
> the
> >>>>> consumers in a group have joined. The challenge would be someone
has
> >> to
> >>>>> define what does ALL consumers mean, it could either be a time or
> >>> number of
> >>>>> consumers, etc.
> >>>>>
> >>>>> 2. Avoid frequent rebalance. For example, if there are 100 consumers
> >> in
> >>> a
> >>>>> group, today, in the worst case, we may end up with 100 rebalances
> >> even
> >>> if
> >>>>> all the consumers joined the group in a reasonably small amount
of
> >> time.
> >>>>> Frequent rebalance is also a bad thing for brokers.
> >>>>>
> >>>>> Having a client side configuration may solve problem 1 better because
> >>> each
> >>>>> consumer group can potentially configure their own timing. However,
> it
> >>> does
> >>>>> not really prevent frequent rebalance in general because some of
the
> >>>>> consumers can be misconfigured. (This may have something to do with
> >>> KIP-124
> >>>>> as well. But if quota is applied on the JoinGroup/SyncGroup request
> it
> >>> may
> >>>>> cause some unwanted cascading effects.)
> >>>>>
> >>>>> Having a broker side configuration may result in less flexibility
for
> >>> each
> >>>>> consumer group, but it can prevent frequent rebalance better. I
think
> >>> with
> >>>>> some reasonable design, the rebalance timing issue can be resolved
on
> >>> the
> >>>>> broker side as well. Matthias had a good point on extending the
delay
> >>> when
> >>>>> a new consumer joins a group (we actually did something similar
to
> >> batch
> >>>>> ISR change propagation). For example, let's say on the broker side,
> we
> >>> will
> >>>>> always delay 2 seconds each time we see a new consumer joining a
> >>> consumer
> >>>>> group. This would probably work for most of the consumer groups
and
> >> will
> >>>>> also limit the rebalance frequency to protect the brokers.
> >>>>>
> >>>>> I am not sure about the streams use case here, but if something
like
> 2
> >>>>> seconds of delay is acceptable for streams, I would prefer adding
the
> >>>>> configuration to the broker so that we can address both problems.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jiangjie (Becket) Qin
> >>>>>
> >>>>>
> >>>>> On Fri, Mar 24, 2017 at 5:30 AM, Damian Guy <damian.guy@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> Thanks for the feedback.
> >>>>>>
> >>>>>> Ewen: I'm happy to make it a client side config. Other than
the
> >>> protocol
> >>>>>> bump i think the effort is almost the same. Personally i see
no
> other
> >>>>>> issues, but based on discussions with others this is what we
came up
> >>> with.
> >>>>>>
> >>>>>> True, it can probably be tested easily via an integration test.
> >>>>>>
> >>>>>> Matthias: Yes i agree, the delay could be extended as each new
> member
> >>> joins
> >>>>>> the group.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Damian
> >>>>>>
> >>>>>> On Fri, 24 Mar 2017 at 05:14 Ewen Cheslack-Postava <
> >> ewen@confluent.io>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I have the same initial response as Ismael re: broker vs
consumer
> >>>>>> settings.
> >>>>>>> The global setting seems questionable.
> >>>>>>>
> >>>>>>> Could we maybe summarize what the impact of making this
a client
> >>> config
> >>>>>>> would be? Protocol bump is obvious, but is there any other
> >> significant
> >>>>>>> issue? For the protocol bump in particular, I think this
change is
> >>>>>>> currently really critical for streams; it will be valuable
> >> elsewhere,
> >>> but
> >>>>>>> the immediate demand is streams, so a protocol bump while
being
> >>> backwards
> >>>>>>> compatible wouldn't affect any other clients. Is this still
> actually
> >>>>>>> compatible with different clients given that they would
now expect
> >>>>>>> different timeouts? (I think it's strictly compatible if
you wait
> >> for
> >>>>>>> responses, but if you enforce any client side timeouts,
I'm not so
> >>> sure.)
> >>>>>>>
> >>>>>>> re: test plan, I'm sure this will come as a surprise, but
is the
> >>> system
> >>>>>>> test even necessary? Validating # of rebalances seems messy
as
> other
> >>>>>> things
> >>>>>>> can cause rebalances (though admittedly not in a "clean"
case). But
> >>>>>> really
> >>>>>>> it seems like an integration test could validate this by
making
> sure
> >>>>>> only 1
> >>>>>>> rebalance occurred when 2 members joined with a sufficient
time
> gap.
> >>>>>>>
> >>>>>>> -Ewen
> >>>>>>>
> >>>>>>> On Thu, Mar 23, 2017 at 3:53 PM, Matthias J. Sax <
> >>> matthias@confluent.io>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for the KIP Damian!
> >>>>>>>>
> >>>>>>>> My two cents:
> >>>>>>>>
> >>>>>>>> - we should have an explicit parameter for this -- implicit
> setting
> >>>>>> are
> >>>>>>>> always tricky (the "importance" of this parameter would
be LOW)
> >>>>>>>>
> >>>>>>>> - the config should be different for each consumer group:
> >>>>>>>>   * assume you have a stateless app, you want to rebalance
> >>> immediately
> >>>>>>>>   * if you start-up in an visualized environment using
some tools
> >>> like
> >>>>>>>> Mesos you might need a different value that on bare
metal (no VM
> to
> >>> be
> >>>>>>>> started)
> >>>>>>>>   * it also depends, how many consumer instanced you
expect --
> it's
> >>>>>>>> harder to start up 100 instances in 3 seconds than 5
> >>>>>>>>
> >>>>>>>> - the default value should be zero
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> One more thought: what about scaling scenarios? If a
consumer
> group
> >>> has
> >>>>>>>> 10 instanced and should be scaled up to 20, it would
make sense to
> >> do
> >>>>>>>> this with a single rebalance, too. Thus, I am wondering,
if it
> >> would
> >>>>>>>> make sense to apply this delay each time a new consumer
joins
> >> group,
> >>>>>>>> even if the group is not empty?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 3/23/17 10:19 AM, Damian Guy wrote:
> >>>>>>>>> Thanks Gouzhang - i think another problem with this
is that is
> >>>>>>>> overloading
> >>>>>>>>> session.timeout.ms to mean multiple things. I'm
not sure that is
> >> a
> >>>>>>> good
> >>>>>>>>> thing.
> >>>>>>>>>
> >>>>>>>>> On Thu, 23 Mar 2017 at 17:14 Guozhang Wang <wangguoz@gmail.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> The downside of it, though, is that although
it "hides" this
> from
> >>>>>> most
> >>>>>>>> of
> >>>>>>>>>> the users needing to be aware of it, by default
session timeout
> >>> i.e.
> >>>>>>> the
> >>>>>>>>>> rebalance timeout is 10 seconds which could
arguably too long.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Guozhang
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Mar 23, 2017 at 10:12 AM, Guozhang Wang
<
> >>> wangguoz@gmail.com
> >>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Just throwing another alternative idea here:
we can consider
> >> using
> >>>>>>> the
> >>>>>>>>>>> rebalance timeout value which is already
included in the join
> >>>>>> request
> >>>>>>>>>>> protocol (and on the current Java client
it is always written
> as
> >>>>>> the
> >>>>>>>>>>> session timeout value), that the first member
joining will
> >> always
> >>>>>>> force
> >>>>>>>>>> the
> >>>>>>>>>>> coordinator to wait that long. By doing
this we do not need to
> >>> bump
> >>>>>>> up
> >>>>>>>>>> the
> >>>>>>>>>>> protocol either.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Mar 23, 2017 at 5:49 AM, Damian
Guy <
> >> damian.guy@gmail.com
> >>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Ismael,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Mostly to avoid the protocol bump.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I agree that it may be difficult to
choose the right delay for
> >>> all
> >>>>>>>>>>>> consumer
> >>>>>>>>>>>> groups, but we wanted to make this something
that most users
> >>> don't
> >>>>>>>>>> really
> >>>>>>>>>>>> need to think about, i.e., a small enough
default delay that
> >>> works
> >>>>>>> in
> >>>>>>>>>> the
> >>>>>>>>>>>> majority of cases. However it would
be much more flexible as a
> >>>>>>>> consumer
> >>>>>>>>>>>> config, which i'm happy to pursue if
this change is worthy of
> a
> >>>>>>>> protocol
> >>>>>>>>>>>> bump.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Damian
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, 23 Mar 2017 at 12:35 Ismael
Juma <ismael@juma.me.uk>
> >>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the KIP, Damian. It makes
sense to avoid multiple
> >>>>>>>>>> rebalances
> >>>>>>>>>>>>> during start-up. One issue with
having this as a broker
> config
> >>> is
> >>>>>>>> that
> >>>>>>>>>>>> it
> >>>>>>>>>>>>> may be difficult to choose the right
delay for all consumer
> >>>>>> groups.
> >>>>>>>>>> Can
> >>>>>>>>>>>> you
> >>>>>>>>>>>>> elaborate a little more on why the
first alternative (add a
> >>>>>>> consumer
> >>>>>>>>>>>>> config) was rejected? We bump protocol
versions regularly
> >> (when
> >>>>>> it
> >>>>>>>>>> makes
> >>>>>>>>>>>>> sense), so it would be good to get
a bit more detail.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>> Ismael
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Thu, Mar 23, 2017 at 12:24 PM,
Damian Guy <
> >>>>>> damian.guy@gmail.com
> >>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi All,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I've prepared a KIP to add a
configurable delay to the
> >> initial
> >>>>>>>>>>>> consumer
> >>>>>>>>>>>>>> group rebalance.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Please have look here:
> >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >>>>>>>>>>>>>> 134%3A+Delay+initial+consumer+group+rebalance
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Damian
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> BTW, i apologize if this appears
twice. Seems the first one
> >> may
> >>>>>>> have
> >>>>>>>>>>>> not
> >>>>>>>>>>>>>> made it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> -- Guozhang
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> -- Guozhang
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> >>>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message