kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eno Thereska <eno.there...@gmail.com>
Subject Re: [DISCUSS] KIP-134: Delay initial consumer group rebalance
Date Fri, 24 Mar 2017 20:23:28 GMT
Thanks Damian,

This KIP deals with the initial phase only. What about the cases when several consumers leave
a group? Won't there be several expensive rebalances then as well? I'm wondering if it makes
sense for the delay to hold anytime the "set" of consumers in a group changes, be it addition
to the group or removal from group.

Thanks
Eno


> On 24 Mar 2017, at 20:04, Becket Qin <becket.qin@gmail.com> wrote:
> 
> Thanks for the KIP, Damian.
> 
> My two cents on this. It seems there are two things worth thinking here:
> 
> 1. Better rebalance timing. We will try to rebalance only when all the
> consumers in a group have joined. The challenge would be someone has to
> define what does ALL consumers mean, it could either be a time or number of
> consumers, etc.
> 
> 2. Avoid frequent rebalance. For example, if there are 100 consumers in a
> group, today, in the worst case, we may end up with 100 rebalances even if
> all the consumers joined the group in a reasonably small amount of time.
> Frequent rebalance is also a bad thing for brokers.
> 
> Having a client side configuration may solve problem 1 better because each
> consumer group can potentially configure their own timing. However, it does
> not really prevent frequent rebalance in general because some of the
> consumers can be misconfigured. (This may have something to do with KIP-124
> as well. But if quota is applied on the JoinGroup/SyncGroup request it may
> cause some unwanted cascading effects.)
> 
> Having a broker side configuration may result in less flexibility for each
> consumer group, but it can prevent frequent rebalance better. I think with
> some reasonable design, the rebalance timing issue can be resolved on the
> broker side as well. Matthias had a good point on extending the delay when
> a new consumer joins a group (we actually did something similar to batch
> ISR change propagation). For example, let's say on the broker side, we will
> always delay 2 seconds each time we see a new consumer joining a consumer
> group. This would probably work for most of the consumer groups and will
> also limit the rebalance frequency to protect the brokers.
> 
> I am not sure about the streams use case here, but if something like 2
> seconds of delay is acceptable for streams, I would prefer adding the
> configuration to the broker so that we can address both problems.
> 
> Thanks,
> 
> Jiangjie (Becket) Qin
> 
> 
> On Fri, Mar 24, 2017 at 5:30 AM, Damian Guy <damian.guy@gmail.com> wrote:
> 
>> Thanks for the feedback.
>> 
>> Ewen: I'm happy to make it a client side config. Other than the protocol
>> bump i think the effort is almost the same. Personally i see no other
>> issues, but based on discussions with others this is what we came up with.
>> 
>> True, it can probably be tested easily via an integration test.
>> 
>> Matthias: Yes i agree, the delay could be extended as each new member joins
>> the group.
>> 
>> Thanks,
>> Damian
>> 
>> On Fri, 24 Mar 2017 at 05:14 Ewen Cheslack-Postava <ewen@confluent.io>
>> wrote:
>> 
>>> I have the same initial response as Ismael re: broker vs consumer
>> settings.
>>> The global setting seems questionable.
>>> 
>>> Could we maybe summarize what the impact of making this a client config
>>> would be? Protocol bump is obvious, but is there any other significant
>>> issue? For the protocol bump in particular, I think this change is
>>> currently really critical for streams; it will be valuable elsewhere, but
>>> the immediate demand is streams, so a protocol bump while being backwards
>>> compatible wouldn't affect any other clients. Is this still actually
>>> compatible with different clients given that they would now expect
>>> different timeouts? (I think it's strictly compatible if you wait for
>>> responses, but if you enforce any client side timeouts, I'm not so sure.)
>>> 
>>> re: test plan, I'm sure this will come as a surprise, but is the system
>>> test even necessary? Validating # of rebalances seems messy as other
>> things
>>> can cause rebalances (though admittedly not in a "clean" case). But
>> really
>>> it seems like an integration test could validate this by making sure
>> only 1
>>> rebalance occurred when 2 members joined with a sufficient time gap.
>>> 
>>> -Ewen
>>> 
>>> On Thu, Mar 23, 2017 at 3:53 PM, Matthias J. Sax <matthias@confluent.io>
>>> wrote:
>>> 
>>>> Thanks for the KIP Damian!
>>>> 
>>>> My two cents:
>>>> 
>>>> - we should have an explicit parameter for this -- implicit setting
>> are
>>>> always tricky (the "importance" of this parameter would be LOW)
>>>> 
>>>> - the config should be different for each consumer group:
>>>>   * assume you have a stateless app, you want to rebalance immediately
>>>>   * if you start-up in an visualized environment using some tools like
>>>> Mesos you might need a different value that on bare metal (no VM to be
>>>> started)
>>>>   * it also depends, how many consumer instanced you expect -- it's
>>>> harder to start up 100 instances in 3 seconds than 5
>>>> 
>>>> - the default value should be zero
>>>> 
>>>> 
>>>> One more thought: what about scaling scenarios? If a consumer group has
>>>> 10 instanced and should be scaled up to 20, it would make sense to do
>>>> this with a single rebalance, too. Thus, I am wondering, if it would
>>>> make sense to apply this delay each time a new consumer joins group,
>>>> even if the group is not empty?
>>>> 
>>>> 
>>>> -Matthias
>>>> 
>>>> 
>>>> On 3/23/17 10:19 AM, Damian Guy wrote:
>>>>> Thanks Gouzhang - i think another problem with this is that is
>>>> overloading
>>>>> session.timeout.ms to mean multiple things. I'm not sure that is a
>>> good
>>>>> thing.
>>>>> 
>>>>> On Thu, 23 Mar 2017 at 17:14 Guozhang Wang <wangguoz@gmail.com>
>> wrote:
>>>>> 
>>>>>> The downside of it, though, is that although it "hides" this from
>> most
>>>> of
>>>>>> the users needing to be aware of it, by default session timeout i.e.
>>> the
>>>>>> rebalance timeout is 10 seconds which could arguably too long.
>>>>>> 
>>>>>> 
>>>>>> Guozhang
>>>>>> 
>>>>>> On Thu, Mar 23, 2017 at 10:12 AM, Guozhang Wang <wangguoz@gmail.com
>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Just throwing another alternative idea here: we can consider
using
>>> the
>>>>>>> rebalance timeout value which is already included in the join
>> request
>>>>>>> protocol (and on the current Java client it is always written
as
>> the
>>>>>>> session timeout value), that the first member joining will always
>>> force
>>>>>> the
>>>>>>> coordinator to wait that long. By doing this we do not need to
bump
>>> up
>>>>>> the
>>>>>>> protocol either.
>>>>>>> 
>>>>>>> 
>>>>>>> Guozhang
>>>>>>> 
>>>>>>> On Thu, Mar 23, 2017 at 5:49 AM, Damian Guy <damian.guy@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Ismael,
>>>>>>>> 
>>>>>>>> Mostly to avoid the protocol bump.
>>>>>>>> 
>>>>>>>> I agree that it may be difficult to choose the right delay
for all
>>>>>>>> consumer
>>>>>>>> groups, but we wanted to make this something that most users
don't
>>>>>> really
>>>>>>>> need to think about, i.e., a small enough default delay that
works
>>> in
>>>>>> the
>>>>>>>> majority of cases. However it would be much more flexible
as a
>>>> consumer
>>>>>>>> config, which i'm happy to pursue if this change is worthy
of a
>>>> protocol
>>>>>>>> bump.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Damian
>>>>>>>> 
>>>>>>>> On Thu, 23 Mar 2017 at 12:35 Ismael Juma <ismael@juma.me.uk>
>> wrote:
>>>>>>>> 
>>>>>>>>> Thanks for the KIP, Damian. It makes sense to avoid multiple
>>>>>> rebalances
>>>>>>>>> during start-up. One issue with having this as a broker
config is
>>>> that
>>>>>>>> it
>>>>>>>>> may be difficult to choose the right delay for all consumer
>> groups.
>>>>>> Can
>>>>>>>> you
>>>>>>>>> elaborate a little more on why the first alternative
(add a
>>> consumer
>>>>>>>>> config) was rejected? We bump protocol versions regularly
(when
>> it
>>>>>> makes
>>>>>>>>> sense), so it would be good to get a bit more detail.
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Ismael
>>>>>>>>> 
>>>>>>>>> On Thu, Mar 23, 2017 at 12:24 PM, Damian Guy <
>> damian.guy@gmail.com
>>>> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi All,
>>>>>>>>>> 
>>>>>>>>>> I've prepared a KIP to add a configurable delay to
the initial
>>>>>>>> consumer
>>>>>>>>>> group rebalance.
>>>>>>>>>> 
>>>>>>>>>> Please have look here:
>>>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>>>> 134%3A+Delay+initial+consumer+group+rebalance
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Damian
>>>>>>>>>> 
>>>>>>>>>> BTW, i apologize if this appears twice. Seems the
first one may
>>> have
>>>>>>>> not
>>>>>>>>>> made it.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> -- Guozhang
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> -- Guozhang
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 


Mime
View raw message