From dev-return-100163-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Mon Dec 3 13:57:43 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 122AD180645 for ; Mon, 3 Dec 2018 13:57:42 +0100 (CET) Received: (qmail 63060 invoked by uid 500); 3 Dec 2018 12:57:41 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 63048 invoked by uid 99); 3 Dec 2018 12:57:41 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Dec 2018 12:57:41 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id ACC0318C075 for ; Mon, 3 Dec 2018 12:57:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.997 X-Spam-Level: * X-Spam-Status: No, score=1.997 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent-io.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id vBQZyMD5FtQU for ; Mon, 3 Dec 2018 12:57:36 +0000 (UTC) Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 0602E60DF7 for ; Mon, 3 Dec 2018 12:57:36 +0000 (UTC) Received: by mail-ed1-f43.google.com with SMTP id x30so10658736edx.2 for ; Mon, 03 Dec 2018 04:57:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent-io.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=pl1S0SmZAy0Np1L00xceK0qpPTeAKMWwev4GhTjIQyE=; b=1wtE3NJsVfFBhvKV4BHKSMJuUjp/uwNiXvnDCX31SudeVbYUg3Qoz4oBjcoQUp1iJy 0kkQ21gLNbH5dl+XL3kjOL2ZDVcRDTm7YBbfHTgAQB605hHM3YBpKrRK8XRbjhr3+Juw Zq4HM6E2wKKCpCFGfeqz7SaxVn+MedSi4i/uO48V+1DLhDnNolzIb/sl1O+UCsoUI3Ci WLj5Q4FZnRzKATL4nQXeGPRIgZKfzlNFPbGLSEcXY2ze1L4R0ZzAACT6vLmwkkZpiksx UfXDmt1QPbQTUTs32J+/lsso5RT+kjxUVP31FXgMz5X2QmBrlWITK32fzm5khHKTe1gJ Oazg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=pl1S0SmZAy0Np1L00xceK0qpPTeAKMWwev4GhTjIQyE=; b=assvCuEM1IdPbMEYHtQQD0a933aSfETdsQaoS1i0BUh6zV80TNPQuDorjX46wAW/7/ IAzzeZ6/dztLNi39YeosdK2656Oie2j8Lid/eJnOcbFZ+RgEI5YuDHn22ySSj/HjBAtI /AxtSKgq0SFncr+Hhhplj/iIE2PXrPdI9IKOob8+c8pPjOGTtUDQCy+8LjsA16jWs7RI bmTk1rdslapZLFEeepVx3v8q1IGPuLHocycFV4OqIHfNEZ0RBVyrAWv8hoBbh3lC2Vte CgaEOeCKTY4Tvwzsk8h2ccyqXFf62aZjoYUZLGja3qXT6NV/0jO8auhEVrETvHOuog+z bBsw== X-Gm-Message-State: AA+aEWa0RPGdf5OAS48R7XNcMWnxQcge517nO1Hx8sRJ/r+9UHEHNg5H rjs1CnVxsg+Ml+fH7VX5GNHELc75A78QOcEFdPDABdf/ X-Google-Smtp-Source: AFSGD/XL4hzWVTMItEhy8Fi7O6FC2e8V6cPuSp1XbKZXhZoEVQO8S/BtB6MqsMPjzFtVovQpCgNt/SUklo2/FUNbodM= X-Received: by 2002:a50:8a03:: with SMTP id i3mr14870053edi.164.1543841855432; Mon, 03 Dec 2018 04:57:35 -0800 (PST) MIME-Version: 1.0 References: <7a8a26a9-cbbc-c298-7e86-3abaea65437f@confluent.io> <27ab8dd4-f546-a93a-e1e9-fb9d91df3e0f@confluent.io> In-Reply-To: From: Stanislav Kozlovski Date: Mon, 3 Dec 2018 12:57:24 +0000 Message-ID: Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="000000000000620465057c1db18c" --000000000000620465057c1db18c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Everything sounds good to me. On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen wrote: > In fact, it's probably better to move KIP-394< > https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+memb= er.id+for+initial+join+group+request> > to the vote stage first, so that it's easier to finalize the timeline and > smooth the rollout plan for KIP-345. Jason and Stanislav, since you two > involve most in this KIP, could you let me know if there is still any > unclarity we want to resolve before moving to vote? > > Best, > Boyang > ________________________________ > From: Boyang Chen > Sent: Saturday, December 1, 2018 10:53 AM > To: dev@kafka.apache.org > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group > request > > Thanks Jason for the reply! Since the overall motivation and design is > pretty clear, I will go ahead to start implementation and we could discus= s > the underlying details in the PR. > > Best, > Boyang > ________________________________ > From: Matthias J. Sax > Sent: Saturday, December 1, 2018 3:12 AM > To: dev@kafka.apache.org > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group > request > > SGTM. > > On 11/30/18 10:17 AM, Jason Gustafson wrote: > > Using the session expiration logic we already have seems like the > simplest > > option (this is probably a one or two line change). The rejoin should b= e > > quick anyway, so I don't think it's worth optimizing for unjoined new > > members. Just my two cents. This is more of an implementation detail, s= o > > need not necessarily be resolved here. > > > > -Jason > > > > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen > wrote: > > > >> Thanks Matthias for the question. I'm thinking of having a separate ha= sh > >> set called `registeredMemberIds` which > >> will be cleared out every time a group finishes one round of rebalance= . > >> Since storing one id is pretty trivial, using > >> purgatory to track the id removal is a bit wasteful in my opinion. > >> ________________________________ > >> From: Matthias J. Sax > >> Sent: Friday, November 30, 2018 10:26 AM > >> To: dev@kafka.apache.org > >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > group > >> request > >> > >> Thanks! Makes sense. > >> > >> I missed that fact, that the `member.id` is added on the second > >> joinGroup request that contains the `member.id`. > >> > >> However, it seems there is another race condition for this design: > >> > >> If two consumers join at the same time, it it possible that the broker > >> assigns the same `member.id` to both (because none of them have joined > >> the group yet--ie, second joinGroup request not sent yet--, the > >> `member.id` is not store broker side yes and broker cannot check for > >> duplicates when creating a new `member.id`. > >> > >> The probability might be fairly low thought. However, what Stanislav > >> proposed, to add the `member.id` directly, and remove it after > >> `session.timeout.ms` sound like a save option that avoids this issue. > >> > >> Thoughts? > >> > >> > >> -Matthias > >> > >> On 11/28/18 8:15 PM, Boyang Chen wrote: > >>> Thanks Matthias for the question, and Stanislav for the explanation! > >>> > >>> For the scenario described, we will never let a member join the > >> GroupMetadata map > >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this: > >>> > >>> 1. Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_I= D; > >>> 2. Broker rejects while allocating a member.id to c1 in response > (c1 > >> protocol version is current); > >>> 3. c1 handles the error and rejoins with assigned member.id; > >>> 4. Broker stores c1 in its group metadata; > >>> 5. Consumer c2 started. Join with UNKNOWN_MEMBER_ID; > >>> 6. Broker rejects while allocating a member.id to c2 in response > (c2 > >> protocol version is current); > >>> 7. c2 fails to get the response/crashes in the middle; > >>> 8. After certain time, c2 restarts a join request with > >> UNKNOWN_MEMBER_ID; > >>> > >>> As you could see, c2 will repeat step 6~8 until successfully send bac= k > a > >> join group request with allocated id. > >>> By then broker will include c2 within the broker metadata map. > >>> > >>> Does this sound clear to you? > >>> > >>> Best, > >>> Boyang > >>> ________________________________ > >>> From: Stanislav Kozlovski > >>> Sent: Wednesday, November 28, 2018 7:39 PM > >>> To: dev@kafka.apache.org > >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > >> group request > >>> > >>> Hey Matthias, > >>> > >>> I think the notion is to have the `session.timeout.ms` to start > ticking > >>> when the broker responds with the member.id. Then, the broker would > >>> properly expire consumers and not hold too many stale ones. > >>> This isn't mentioned in the KIP though so it is worth to wait for > Boyang > >> to > >>> confirm > >>> > >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax > > >>> wrote: > >>> > >>>> Thanks for the KIP Boyang. > >>>> > >>>> I guess I am missing something, but I am still learning more details > >>>> about the rebalance protocol, so maybe you can help me out? > >>>> > >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup > request. > >>>> The broker generates a `member.id` and sends it back via > >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach > the > >>>> client or the client fails before it can send the second joinGroup > >>>> request. Thus, a client would need to start over with a new > >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs t= o > >>>> generate a new `member.id` again. > >>>> > >>>> So it seems the problem is moved, but not resolved? The motivation o= f > >>>> the KIP is: > >>>> > >>>>> The edge case is that if initial join group request keeps failing d= ue > >> to > >>>> connection timeout, or the consumer keeps restarting, > >>>> > >>>> From my understanding, this KIP move the issue from the first to the > >>>> second joinGroup request (or broker joinGroup response). > >>>> > >>>> But maybe I am missing something. Can you help me out? > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> > >>>> On 11/27/18 6:00 PM, Boyang Chen wrote: > >>>>> Thanks Stanislav and Jason for the suggestions! > >>>>> > >>>>> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bu= mp > >> the > >>>>>> version of the JoinGroup protocol in order to indicate compatibili= ty > >>>> with > >>>>>> the new behavior. The coordinator needs to know when it is safe to > >>>> assume > >>>>>> the client will handle the error code. > >>>>>> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > >> error > >>>>>> code. When the client sees this error code, it will take the > memberId > >>>> from > >>>>>> the response and rejoin. We'd still need the protocol bump since > older > >>>>>> consumers do not have this logic. > >>>>> > >>>>> I will add the join group protocol version change to the KIP. > Meanwhile > >>>> I feel for > >>>>> understandability it's better to define a separate error code since > >>>> REBALANCE_IN_PROGRESS > >>>>> is not the actual cause of the returned error. > >>>>> > >>>>>> One small question I have is now that we have one and a half > >> round-trips > >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it > to > >>>>>> consider increasing the default value of ` > >>>> group.initial.rebalance.delay.ms`? > >>>>> I guess we could keep it for now. After KIP-345 and incremental > >>>> cooperative rebalancing > >>>>> work we should be safe to deprecate ` > group.initial.rebalance.delay.ms > >> `. > >>>> Also one round trip > >>>>> shouldn't increase the latency too much IMO. > >>>>> > >>>>> Best, > >>>>> Boyang > >>>>> ________________________________ > >>>>> From: Stanislav Kozlovski > >>>>> Sent: Wednesday, November 28, 2018 2:32 AM > >>>>> To: dev@kafka.apache.org > >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join > >>>> group request > >>>>> > >>>>> Hi Boyang, > >>>>> > >>>>> The KIP looks very good. > >>>>> One small question I have is now that we have one and a half > >> round-trips > >>>>> needed to join in a rebalance (1 full RT addition), is it worth it = to > >>>>> consider increasing the default value of ` > >>>> group.initial.rebalance.delay.ms`? > >>>>> > >>>>> Best, > >>>>> Stanislav > >>>>> > >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson > >>>> wrote: > >>>>> > >>>>>> Hi Boyang, > >>>>>> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bu= mp > >> the > >>>>>> version of the JoinGroup protocol in order to indicate compatibili= ty > >>>> with > >>>>>> the new behavior. The coordinator needs to know when it is safe to > >>>> assume > >>>>>> the client will handle the error code. > >>>>>> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS > >> error > >>>>>> code. When the client sees this error code, it will take the > memberId > >>>> from > >>>>>> the response and rejoin. We'd still need the protocol bump since > older > >>>>>> consumers do not have this logic. > >>>>>> > >>>>>> Thanks, > >>>>>> Jason > >>>>>> > >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen > >>>> wrote: > >>>>>> > >>>>>>> Hey friends, > >>>>>>> > >>>>>>> > >>>>>>> I would like to start a discussion thread for KIP-394 which is > trying > >>>> to > >>>>>>> mitigate broker cache bursting issue due to anonymous join group > >>>>>> requests: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>> > >> > https://eur01.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Fcwiki= .apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmembe= r.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&data=3D02%7C01%7C%7C3ca95= 629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C63= 6792296362447479&sdata=3D3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T= 8%3D&reserved=3D0 > >>>>>>> > >>>>>>> > >>>>>>> Thanks! > >>>>>>> > >>>>>>> Boyang > >>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Best, > >>>>> Stanislav > >>>>> > >>>> > >>>> > >>> > >>> -- > >>> Best, > >>> Stanislav > >>> > >> > >> > > > > --=20 Best, Stanislav --000000000000620465057c1db18c--