kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Kozlovski <stanis...@confluent.io>
Subject Re: [DISCUSS] KIP-394: Require member.id for initial join group request
Date Mon, 03 Dec 2018 12:57:24 GMT
Everything sounds good to me.

On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen <bchen11@outlook.com> wrote:

> In fact, it's probably better to move KIP-394<
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request>
> to the vote stage first, so that it's easier to finalize the timeline and
> smooth the rollout plan for KIP-345. Jason and Stanislav, since you two
> involve most in this KIP, could you let me know if there is still any
> unclarity we want to resolve before moving to vote?
>
> Best,
> Boyang
> ________________________________
> From: Boyang Chen <bchen11@outlook.com>
> Sent: Saturday, December 1, 2018 10:53 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> Thanks Jason for the reply! Since the overall motivation and design is
> pretty clear, I will go ahead to start implementation and we could discuss
> the underlying details in the PR.
>
> Best,
> Boyang
> ________________________________
> From: Matthias J. Sax <matthias@confluent.io>
> Sent: Saturday, December 1, 2018 3:12 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> SGTM.
>
> On 11/30/18 10:17 AM, Jason Gustafson wrote:
> > Using the session expiration logic we already have seems like the
> simplest
> > option (this is probably a one or two line change). The rejoin should be
> > quick anyway, so I don't think it's worth optimizing for unjoined new
> > members. Just my two cents. This is more of an implementation detail, so
> > need not necessarily be resolved here.
> >
> > -Jason
> >
> > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bchen11@outlook.com>
> wrote:
> >
> >> Thanks Matthias for the question. I'm thinking of having a separate hash
> >> set called `registeredMemberIds` which
> >> will be cleared out every time a group finishes one round of rebalance.
> >> Since storing one id is pretty trivial, using
> >> purgatory to track the id removal is a bit wasteful in my opinion.
> >> ________________________________
> >> From: Matthias J. Sax <matthias@confluent.io>
> >> Sent: Friday, November 30, 2018 10:26 AM
> >> To: dev@kafka.apache.org
> >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group
> >> request
> >>
> >> Thanks! Makes sense.
> >>
> >> I missed that fact, that the `member.id` is added on the second
> >> joinGroup request that contains the `member.id`.
> >>
> >> However, it seems there is another race condition for this design:
> >>
> >> If two consumers join at the same time, it it possible that the broker
> >> assigns the same `member.id` to both (because none of them have joined
> >> the group yet--ie, second joinGroup request not sent yet--, the
> >> `member.id` is not store broker side yes and broker cannot check for
> >> duplicates when creating a new `member.id`.
> >>
> >> The probability might be fairly low thought. However, what Stanislav
> >> proposed, to add the `member.id` directly, and remove it after
> >> `session.timeout.ms` sound like a save option that avoids this issue.
> >>
> >> Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >> On 11/28/18 8:15 PM, Boyang Chen wrote:
> >>> Thanks Matthias for the question, and Stanislav for the explanation!
> >>>
> >>> For the scenario described, we will never let a member join the
> >> GroupMetadata map
> >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> >>>
> >>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
> >>>   2.  Broker rejects while allocating a member.id to c1 in response
> (c1
> >> protocol version is current);
> >>>   3.  c1 handles the error and rejoins with assigned member.id;
> >>>   4.  Broker stores c1 in its group metadata;
> >>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> >>>   6.  Broker rejects while allocating a member.id to c2 in response
> (c2
> >> protocol version is current);
> >>>   7.  c2 fails to get the response/crashes in the middle;
> >>>   8.  After certain time, c2 restarts a join request with
> >> UNKNOWN_MEMBER_ID;
> >>>
> >>> As you could see, c2 will repeat step 6~8 until successfully send back
> a
> >> join group request with allocated id.
> >>> By then broker will include c2 within the broker metadata map.
> >>>
> >>> Does this sound clear to you?
> >>>
> >>> Best,
> >>> Boyang
> >>> ________________________________
> >>> From: Stanislav Kozlovski <stanislav@confluent.io>
> >>> Sent: Wednesday, November 28, 2018 7:39 PM
> >>> To: dev@kafka.apache.org
> >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >> group request
> >>>
> >>> Hey Matthias,
> >>>
> >>> I think the notion is to have the `session.timeout.ms` to start
> ticking
> >>> when the broker responds with the member.id. Then, the broker would
> >>> properly expire consumers and not hold too many stale ones.
> >>> This isn't mentioned in the KIP though so it is worth to wait for
> Boyang
> >> to
> >>> confirm
> >>>
> >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matthias@confluent.io
> >
> >>> wrote:
> >>>
> >>>> Thanks for the KIP Boyang.
> >>>>
> >>>> I guess I am missing something, but I am still learning more details
> >>>> about the rebalance protocol, so maybe you can help me out?
> >>>>
> >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup
> request.
> >>>> The broker generates a `member.id` and sends it back via
> >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach
> the
> >>>> client or the client fails before it can send the second joinGroup
> >>>> request. Thus, a client would need to start over with a new
> >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> >>>> generate a new `member.id` again.
> >>>>
> >>>> So it seems the problem is moved, but not resolved? The motivation of
> >>>> the KIP is:
> >>>>
> >>>>> The edge case is that if initial join group request keeps failing
due
> >> to
> >>>> connection timeout, or the consumer keeps restarting,
> >>>>
> >>>> From my understanding, this KIP move the issue from the first to the
> >>>> second joinGroup request (or broker joinGroup response).
> >>>>
> >>>> But maybe I am missing something. Can you help me out?
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
> >>>>> Thanks Stanislav and Jason for the suggestions!
> >>>>>
> >>>>>
> >>>>>> Thanks for the KIP. Looks good overall. I think we will need
to bump
> >> the
> >>>>>> version of the JoinGroup protocol in order to indicate compatibility
> >>>> with
> >>>>>> the new behavior. The coordinator needs to know when it is safe
to
> >>>> assume
> >>>>>> the client will handle the error code.
> >>>>>>
> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> >> error
> >>>>>> code. When the client sees this error code, it will take the
> memberId
> >>>> from
> >>>>>> the response and rejoin. We'd still need the protocol bump since
> older
> >>>>>> consumers do not have this logic.
> >>>>>
> >>>>> I will add the join group protocol version change to the KIP.
> Meanwhile
> >>>> I feel for
> >>>>> understandability it's better to define a separate error code since
> >>>> REBALANCE_IN_PROGRESS
> >>>>> is not the actual cause of the returned error.
> >>>>>
> >>>>>> One small question I have is now that we have one and a half
> >> round-trips
> >>>>>> needed to join in a rebalance (1 full RT addition), is it worth
it
> to
> >>>>>> consider increasing the default value of `
> >>>> group.initial.rebalance.delay.ms`?
> >>>>> I guess we could keep it for now. After KIP-345 and incremental
> >>>> cooperative rebalancing
> >>>>> work we should be safe to deprecate `
> group.initial.rebalance.delay.ms
> >> `.
> >>>> Also one round trip
> >>>>> shouldn't increase the latency too much IMO.
> >>>>>
> >>>>> Best,
> >>>>> Boyang
> >>>>> ________________________________
> >>>>> From: Stanislav Kozlovski <stanislav@confluent.io>
> >>>>> Sent: Wednesday, November 28, 2018 2:32 AM
> >>>>> To: dev@kafka.apache.org
> >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >>>> group request
> >>>>>
> >>>>> Hi Boyang,
> >>>>>
> >>>>> The KIP looks very good.
> >>>>> One small question I have is now that we have one and a half
> >> round-trips
> >>>>> needed to join in a rebalance (1 full RT addition), is it worth
it to
> >>>>> consider increasing the default value of `
> >>>> group.initial.rebalance.delay.ms`?
> >>>>>
> >>>>> Best,
> >>>>> Stanislav
> >>>>>
> >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <jason@confluent.io>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Boyang,
> >>>>>>
> >>>>>> Thanks for the KIP. Looks good overall. I think we will need
to bump
> >> the
> >>>>>> version of the JoinGroup protocol in order to indicate compatibility
> >>>> with
> >>>>>> the new behavior. The coordinator needs to know when it is safe
to
> >>>> assume
> >>>>>> the client will handle the error code.
> >>>>>>
> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> >> error
> >>>>>> code. When the client sees this error code, it will take the
> memberId
> >>>> from
> >>>>>> the response and rejoin. We'd still need the protocol bump since
> older
> >>>>>> consumers do not have this logic.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jason
> >>>>>>
> >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bchen11@outlook.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hey friends,
> >>>>>>>
> >>>>>>>
> >>>>>>> I would like to start a discussion thread for KIP-394 which
is
> trying
> >>>> to
> >>>>>>> mitigate broker cache bursting issue due to anonymous join
group
> >>>>>> requests:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&amp;sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&amp;reserved=0
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Boyang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best,
> >>>>> Stanislav
> >>>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >>>
> >>
> >>
> >
>
>

-- 
Best,
Stanislav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message