From dev-return-107121-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Tue Sep 3 20:51:21 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2C5F3180637 for ; Tue, 3 Sep 2019 22:51:21 +0200 (CEST) Received: (qmail 35143 invoked by uid 500); 3 Sep 2019 23:46:05 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 35131 invoked by uid 99); 3 Sep 2019 23:46:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Sep 2019 23:46:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2CD26180281 for ; Tue, 3 Sep 2019 20:51:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.301 X-Spam-Level: *** X-Spam-Status: No, score=3.301 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, FROM_LOCAL_NOVOWEL=0.5, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id xHlcTdaxHod4 for ; Tue, 3 Sep 2019 20:51:11 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::d32; helo=mail-io1-xd32.google.com; envelope-from=rndgstn@gmail.com; receiver= Received: from mail-io1-xd32.google.com (mail-io1-xd32.google.com [IPv6:2607:f8b0:4864:20::d32]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 52F727DD5E for ; Tue, 3 Sep 2019 20:51:10 +0000 (UTC) Received: by mail-io1-xd32.google.com with SMTP id f19so858312iof.8 for ; Tue, 03 Sep 2019 13:51:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=+tIKFTKXZ3q/w1JvCmYJuE32V/Q3Wvzv0VD/NGMjgf0=; b=tKRCyZ+G6Y8RtENavcIIg+a5j7hDmY5c3tm4ZnowIC20beXIRXSmFNIN+DbkBkQR5s gF9THqK3DHf/ceaxGDYnRSRXLahTm/hztJHRPoiCX9cGMrG+d6vl6ZCsPzHg4WpPywOr gkKPge404W/ajdtWkThE474muKS1ByJmi9LMfNWTJCNfPtqSBL06TTaAkDa1Y9o8O3Gf e2IxaTdSyaPtBEc/Vv+rwqJ2+7jZP52tL6++TUjQHIxVMhczyVha983ELz2f5W2PdRPd ZtdehAEIorM6ZyOgw1bChTpVctdoMO2w5g4XjwA2OZxCVkIx5jifpj8xkao0Bm0DQurZ 9+GA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=+tIKFTKXZ3q/w1JvCmYJuE32V/Q3Wvzv0VD/NGMjgf0=; b=tt6o7cSfDIZDJcdVcK6S0yt831rjn4dLwlacMB/RllNIxLpEQwX8G7d/L70TKrvgig PCidRL4xGniZcpsEuWdPaQANm/AlQa1pFaajfv7ETKbnRYGTpMc8F1b47j2lzWnEl57K 6dpZtO8NzF6V/9Ib0UshWdD+3slqF6UaMxROTybLdXGM8c/aH2c853Hk6VgzBY1SfhGu 8sEGZyEQ7YU5y0ClbvX5ALy4er6zcoPEx8w+P4WTpGpzZFP7KqCzxgzeyyJwJjmJ84xI GiosSDsUuxNulwHTYR5TbrQXNtgNbMVjEuOZDxlxaJAnX9H/RJDbNTTz8JGaQ6ANVE29 T6UA== X-Gm-Message-State: APjAAAXFI6k//lwJfUdSzeR0x3Kwf1TGorjIbu4OECtWSYlL1U1qQIo6 mR21fl/4jH7jDS8LPgIN+zD8WUQXCsgsbGieyuN0ew== X-Google-Smtp-Source: APXvYqwxdxVFER/8FU+/1MGwaeyDhl0bENrnp/5mwp5cwvF3IrjK/rMvZ8rUo9aV3L0H3bqnfTpHxgRLfg58pOQosZc= X-Received: by 2002:a6b:b792:: with SMTP id h140mr13136721iof.225.1567543862757; Tue, 03 Sep 2019 13:51:02 -0700 (PDT) MIME-Version: 1.0 References: <883db452-f3a4-4c55-9015-91d16e8aeb1f@www.fastmail.com> <2766ce05-2329-4ff3-a33e-8ffb6117d1b9@www.fastmail.com> <74b648c7-0cb5-472a-b91b-6cd06a8a026b@www.fastmail.com> <0e06625e-6ac5-4f1d-814c-eeb0bd317ccc@www.fastmail.com> <7feaff4e-adf9-4411-9a1a-2102b9408025@www.fastmail.com> <571c1e7a-66fe-4a6f-838d-8c49381f9e16@www.fastmail.com> <0c5fe58f-ef89-43c0-8889-5729bb078589@www.fastmail.com> <302af01a-3d45-4663-976d-40bc2c65cd22@www.fastmail.com> <91C7C226-A5A9-40FB-B7FE-ADE3F2225700@gmail.com> In-Reply-To: From: Ron Dagostino Date: Tue, 3 Sep 2019 16:50:50 -0400 Message-ID: Subject: Re: [DISCUSS] KIP-500: Replace ZooKeeper with a Self-Managed Metadata Quorum To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="0000000000001c13530591ac3fef" --0000000000001c13530591ac3fef Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks, Colin. That all makes sense, especially the part about the onerous testing requirements associated with supporting both Zookeeper and the new metadata quorum simultaneously. Given that, I now buy into the idea that the transition to the new metadata quorum becomes the main path forward once the final bridge release is cut. There may be back-ports of bugs and features based on demand as you suggest at the end of your reply, but there is no guarantee of that happening. I'm good with that now. Ron On Tue, Sep 3, 2019 at 1:05 PM Colin McCabe wrote: > On Mon, Sep 2, 2019, at 07:51, Ron Dagostino wrote: > > Hi Colin. It is not unusual for customers to wait before upgrading =E2= =80=94 > > to avoid so-called =E2=80=9Cpoint-zero=E2=80=9D releases =E2=80=94 to a= void as many of the > > inevitable bugs that ride along with new functionality as possible. > > Removal of Zookeeper is going feel especially risky to these customers, > > and arguably it is going to feel risky even to customers who might > > otherwise be less sensitive to upgrade risk. > > > > This leads me to believe it is reasonable to expect that the uptake of > > the new ZK-less consensus quorum could be delayed in many installations > > =E2=80=94 that such customers might wait longer than usual to adopt the= feature > > and abandon their Zookeeper servers. > > > > Will it be possible to use releases beyond the bridge release and not > > abandon Zookeeper? For example, what would happen if post-bridge the > > new consensus quorum servers are never started? Would Kafka still work > > fine? At what point MUST Zookeeper be abandoned? Taking the > > perspective of the above customers, I think they would prefer to have > > others adopt the new ZK-less consensus quorum for several months and > > encounter many of the inevitable bugs before adopting it themselves. > > But at the same time they will not want to be stuck on the bridge > > release that whole time because there are going to be both bug fixes > > and new features that they will want to take advantage of. > > > > If the bridge release is the last one that supports Zookeeper, and if > > some customers stay on that release for a while, then I could see those > > customers wanting back-ports of bug fixes and features to occur for a > > period of time that extends beyond what is normally done. > > > > Basically, to sum all of the above up, I think there is a reasonable > > probability that a single bridge release only could become a potential > > barrier that causes angst for the project and the community. > > > > I wonder if it would be in the interest of the project and the > > community to mitigate the risk of there being a bridge release barrier > > by extending the time when ZK would still be supported =E2=80=94 perhap= s for up > > to a year =E2=80=94 and the new co send us quorum could remain optional= . > > > > Ron > > > > Hi Ron, > > Changing things always involves risk. This is why we are trying to do as > much as we can incrementally. For example, removing ZK dependencies from > tools, and from brokers. However, there are things that can't really be > done incrementally, and one of these is switching over to a new metadata > store. > > It might seem like supporting multiple options for where to store metadat= a > would be safer somehow, but actually this is not the case. Having to > support totally different code paths involves a lot more work and a lot > more testing. We already have configurations that aren't tested enough. > Doubling (at least) the number of configurations we have to test is a > non-starter. > > This also ties in with the discussion in the KIP about why we don't plan > on supporting pluggable consensus or pluggable metadata storage. Doing > this would force us to use only the least-common denominator features of > every metadata storage. We would not be able to take advantage of metada= ta > as a stream of events, or any of the features that ZK doesn't have. > Configuration would also be quite complex. > > As the KIP states, the upgrade from a bridge release (there may be severa= l > bridge releases) to ZK will have no impact on clients. It also won't hav= e > any impact on cluster sizing (ZK nodes will simply become controller > nodes). And it will be possible to do with a rolling upgrade. I agree > that some people may be nervous about running the new software, and we ma= y > want to have more point releases of the older branches. > > This is something that we'll discuss when people propose release > schedules. In general, this isn't fundamentally different than someone > wanting a new release of 1.x because they don't want to upgrade to 2.x. = If > there's enough interest, we'll do it. > > best, > Colin > > > > > > On Aug 26, 2019, at 6:55 PM, Colin McCabe wrote: > > > > > > Hi Ryanne, > > > > > > Good point. I added a section titled "future work" with information > about the follow-on KIPs that we discussed here. > > > > > > best, > > > Colin > > > > > > > > >> On Fri, Aug 23, 2019, at 13:15, Ryanne Dolan wrote: > > >> Thanks Colin, sgtm. Please make this clear in the KIP -- otherwise i= t > is > > >> hard to nail down what we are voting for. > > >> > > >> Ryanne > > >> > > >> > > >>> On Fri, Aug 23, 2019, 12:58 PM Colin McCabe > wrote: > > >>> > > >>>> On Fri, Aug 23, 2019, at 06:24, Ryanne Dolan wrote: > > >>>> Colin, can you outline what specifically would be in scope for thi= s > KIP > > >>> vs > > >>>> deferred to the follow-on KIPs you've mentioned? Maybe a Future Wo= rk > > >>>> section? Is the idea to get to the bridge release with this KIP, > and then > > >>>> go from there? > > >>>> > > >>>> Ryanne > > >>>> > > >>> > > >>> Hi Ryanne, > > >>> > > >>> The goal for KIP-500 is to set out an overall vision for how we wil= l > > >>> remove ZooKeeper and transition to managing metadata via a controll= er > > >>> quorum. > > >>> > > >>> We will create follow-on KIPs that will lay out the specific detail= s > of > > >>> each step. > > >>> > > >>> * A KIP for allowing kafka-configs.sh to change topic configuration= s > > >>> without using ZooKeeper. (It can already change broker > configurations > > >>> without ZK) > > >>> > > >>> * A KIP for adding APIs to replace direct ZK access by the brokers. > > >>> > > >>> * A KIP to describe Raft replication in Kafka, including the overal= l > > >>> protocol, details of each RPC, etc. > > >>> > > >>> * A KIP describing the controller changes, how metadata is stored, > etc. > > >>> > > >>> There may be other KIPs that we need (for example, if we find > another tool > > >>> that still has a hard ZK dependency), but that's the general idea. > KIP-500 > > >>> is about the overall design-- the follow on KIPs are about the > specific > > >>> details. > > >>> > > >>> best, > > >>> Colin > > >>> > > >>> > > >>>> > > >>>>> On Thu, Aug 22, 2019, 11:58 AM Colin McCabe > wrote: > > >>>>> > > >>>>>> On Wed, Aug 21, 2019, at 19:48, Ron Dagostino wrote: > > >>>>>> Thanks, Colin. The changes you made to the KIP related to the > bridge > > >>>>>> release help make it clearer. I still have some confusion about > the > > >>>>> phrase > > >>>>>> "The rolling upgrade from the bridge release will take several > > >>> steps." > > >>>>>> This made me think you are talking about moving from the bridge > > >>> release > > >>>>> to > > >>>>>> some other, newer, release that comes after the bridge release. > But > > >>> I > > >>>>>> think what you are getting at is that the bridge release can be > run > > >>> with > > >>>>> or > > >>>>>> without Zookeeper -- when first upgrading to it Zookeeper remain= s > in > > >>> use, > > >>>>>> but then there is a transition that can be made to engage the wa= rp > > >>>>> drive... > > >>>>>> I mean the Controller Quorum. So maybe the phrase should be "Th= e > > >>> rolling > > >>>>>> upgrade through the bridge release -- starting with Zookeeper > being > > >>> in > > >>>>> use > > >>>>>> and ending with Zookeeper having been replaced by the Controller > > >>> Quorum > > >>>>> -- > > >>>>>> will take several steps." > > >>>>> > > >>>>> Hi Ron, > > >>>>> > > >>>>> To clarify, the bridge release will require ZooKeeper. It will > also > > >>> not > > >>>>> support the controller quorum. It's a bridge in the sense that y= ou > > >>> must > > >>>>> upgrade to a bridge release prior to upgrading to a ZK-less > release. I > > >>>>> added some more descriptive text to the bridge release paragraph-= - > > >>>>> hopefully this makes it clearer. > > >>>>> > > >>>>> best, > > >>>>> Colin > > >>>>> > > >>>>>> > > >>>>>> Do I understand it correctly, and might some change in phrasing = or > > >>>>>> additional clarification help others avoid the same confusion I > had? > > >>>>>> > > >>>>>> Ron > > >>>>>> > > >>>>>> On Wed, Aug 21, 2019 at 2:31 PM Colin McCabe > > >>> wrote: > > >>>>>> > > >>>>>>>> On Wed, Aug 21, 2019, at 04:22, Ron Dagostino wrote: > > >>>>>>>> Hi Colin. I like the concept of a "bridge release" for > migrating > > >>>>> off of > > >>>>>>>> Zookeeper, but I worry that it may become a bottleneck if peop= le > > >>>>> hesitate > > >>>>>>>> to replace Zookeeper -- they would be unable to adopt newer > > >>> versions > > >>>>> of > > >>>>>>>> Kafka until taking (what feels to them like) a giant leap. As > an > > >>>>>>> example, > > >>>>>>>> assuming version 4.0.x of Kafka is the supported bridge releas= e, > > >>> I > > >>>>> would > > >>>>>>>> not be surprised if uptake of the 4.x release and the time-bas= ed > > >>>>> releases > > >>>>>>>> that follow it end up being much slower due to the perceived > > >>> barrier. > > >>>>>>>> > > >>>>>>>> Any perceived barrier could be lowered if the 4.0.x release > could > > >>>>>>>> optionally continue to use Zookeeper -- then the cutover would > > >>> be two > > >>>>>>>> incremental steps (move to 4.0.x, then replace Zookeeper while > > >>>>> staying on > > >>>>>>>> 4.0.x) as opposed to a single big-bang (upgrade to 4.0.x and > > >>> replace > > >>>>>>>> Zookeeper in one fell swoop). > > >>>>>>> > > >>>>>>> Hi Ron, > > >>>>>>> > > >>>>>>> Just to clarify, the "bridge release" will continue to use > > >>> ZooKeeper. > > >>>>> It > > >>>>>>> will not support running without ZooKeeper. It is the releases > > >>> that > > >>>>> follow > > >>>>>>> the bridge release that will remove ZooKeeper. > > >>>>>>> > > >>>>>>> Also, it's a bit unclear whether the bridge release would be 3.= x > or > > >>>>> 4.x, > > >>>>>>> or something to follow. We do know that the bridge release can= 't > > >>> be a > > >>>>> 2.x > > >>>>>>> release, since it requires at least one incompatible change, > > >>> removing > > >>>>>>> --zookeeper options from all the shell scripts. (Since we're > doing > > >>>>>>> semantic versioning, any time we make an incompatible change, w= e > > >>> bump > > >>>>> the > > >>>>>>> major version number.) > > >>>>>>> > > >>>>>>> In general, using two sources of metadata is a lot more complex > and > > >>>>>>> error-prone than one. A lot of the bugs and corner cases we ha= ve > > >>> are > > >>>>> the > > >>>>>>> result of divergences between the controller and the state in > > >>>>> ZooKeeper. > > >>>>>>> Eliminating this divergence, and the split-brain scenarios it > > >>> creates, > > >>>>> is a > > >>>>>>> major goal of this work. > > >>>>>>> > > >>>>>>>> > > >>>>>>>> Regardless of whether what I wrote above has merit or not, I > > >>> think > > >>>>> the > > >>>>>>> KIP > > >>>>>>>> should be more explicit about what the upgrade constraints > > >>> actually > > >>>>> are. > > >>>>>>>> Can the bridge release be adopted with Zookeeper remaining in > > >>> place > > >>>>> and > > >>>>>>>> then cutting over as a second, follow-on step, or must the > > >>> Controller > > >>>>>>>> Quorum nodes be started first and the bridge release cannot be > > >>> used > > >>>>> with > > >>>>>>>> Zookeeper at all? > > >>>>>>> > > >>>>>>> As I mentioned above, the bridge release supports (indeed, > > >>> requires) > > >>>>>>> ZooKeeper. I have added a little more text about this to KIP-5= 00 > > >>> which > > >>>>>>> hopefully makes it clearer. > > >>>>>>> > > >>>>>>> best, > > >>>>>>> Colin > > >>>>>>> > > >>>>>>>> If the bridge release cannot be used with Zookeeper at > > >>>>>>>> all, then no version at or beyond the bridge release is > available > > >>>>>>>> unless/until abandoning Zookeeper; if the bridge release can b= e > > >>> used > > >>>>> with > > >>>>>>>> Zookeeper, then is it the only version that can be used with > > >>>>> Zookeeper, > > >>>>>>> or > > >>>>>>>> can Zookeeper be kept for additional releases if desired? > > >>>>>>>> > > >>>>>>>> Ron > > >>>>>>>> > > >>>>>>>> On Tue, Aug 20, 2019 at 10:19 AM Ron Dagostino < > > >>> rndgstn@gmail.com> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hi Colin. The diagram up at the top confused me -- > > >>> specifically, > > >>>>> the > > >>>>>>>>> lines connecting the controller/active-controller to the > > >>> brokers. > > >>>>> I > > >>>>>>> had > > >>>>>>>>> assumed the arrows on those lines represented the direction o= f > > >>> data > > >>>>>>> flow, > > >>>>>>>>> but that is not the case; the arrows actually identify the > > >>> target > > >>>>> of > > >>>>>>> the > > >>>>>>>>> action, and the non-arrowed end indicates the initiator of th= e > > >>>>>>> action. For > > >>>>>>>>> example, the lines point from the controller to the brokers i= n > > >>> the > > >>>>>>> "today" > > >>>>>>>>> section on the left to show that the controller pushes to the > > >>>>> brokers; > > >>>>>>> the > > >>>>>>>>> lines point from the brokers to the active-controller in the > > >>>>> "tomorrow" > > >>>>>>>>> section on the right to show that the brokers pull from the > > >>>>>>>>> active-controller. As I said, this confused me because my gu= t > > >>>>>>> instinct was > > >>>>>>>>> to interpret the arrow as indicating the direction of data > > >>> flow, > > >>>>> and > > >>>>>>> when I > > >>>>>>>>> look at the "tomorrow" picture on the right I initially thoug= ht > > >>>>>>> information > > >>>>>>>>> was moving from the brokers to the active-controller. Did yo= u > > >>>>> consider > > >>>>>>>>> drawing that picture with the arrows reversed in the "tomorro= w" > > >>>>> side so > > >>>>>>>>> that the arrows represent the direction of data flow, and the= n > > >>> add > > >>>>> the > > >>>>>>>>> labels "push" on the "today" side and "pull" on the "tomorrow= " > > >>>>> side to > > >>>>>>>>> indicate who initiates the data flow? It occurs to me that > > >>> this > > >>>>>>> picture > > >>>>>>>>> may end up being widely distributed, so it might be in > > >>> everyone's > > >>>>>>> interest > > >>>>>>>>> to proactively avoid any possible confusion by being more > > >>> explicit. > > >>>>>>>>> > > >>>>>>>>> Minor corrections? > > >>>>>>>>> << > >>> but > > >>>>> which > > >>>>>>>>> is partitioned from the active controller > > >>>>>>>>>>>> In the current world, a broker which can contact ZooKeeper > > >>> but > > >>>>> which > > >>>>>>>>> is partitioned from the controller > > >>>>>>>>> > > >>>>>>>>> << > >>>>> offline > > >>>>>>>>>>>> Eventually, the active controller will ask the broker to > > >>>>> finally go > > >>>>>>>>> offline > > >>>>>>>>> > > >>>>>>>>> << > >>>>> directly to > > >>>>>>>>> the controller > > >>>>>>>>>>>> New versions of the clients should send these operations > > >>>>> directly to > > >>>>>>>>> the active controller > > >>>>>>>>> > > >>>>>>>>> << > >>>>> controller > > >>>>>>>>> instead > > >>>>>>>>>>>> In the post-ZK world, the leader will make an RPC to the > > >>> active > > >>>>>>>>> controller instead > > >>>>>>>>> > > >>>>>>>>> << > >>> to > > >>>>> the > > >>>>>>>>> controller. > > >>>>>>>>>>>> For example, the brokers may need to forward their request= s > > >>> to > > >>>>> the > > >>>>>>>>> active controller. > > >>>>>>>>> > > >>>>>>>>> << > >>> node > > >>>>>>>>> registrations > > >>>>>>>>>>>> The new (active) controller will monitor ZooKeeper for > > >>> legacy > > >>>>> broker > > >>>>>>>>> node registrations > > >>>>>>>>> > > >>>>>>>>> Ron > > >>>>>>>>> > > >>>>>>>>> On Mon, Aug 19, 2019 at 6:53 PM Colin McCabe < > > >>> cmccabe@apache.org> > > >>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> Hi all, > > >>>>>>>>>> > > >>>>>>>>>> The KIP has been out for a while, so I'm thinking about > > >>> calling a > > >>>>> vote > > >>>>>>>>>> some time this week. > > >>>>>>>>>> > > >>>>>>>>>> best, > > >>>>>>>>>> Colin > > >>>>>>>>>> > > >>>>>>>>>>> On Mon, Aug 19, 2019, at 15:52, Colin McCabe wrote: > > >>>>>>>>>>>> On Mon, Aug 19, 2019, at 12:52, David Arthur wrote: > > >>>>>>>>>>>> Thanks for the KIP, Colin. This looks great! > > >>>>>>>>>>>> > > >>>>>>>>>>>> I really like the idea of separating the Controller and > > >>> Broker > > >>>>>>> JVMs. > > >>>>>>>>>>>> > > >>>>>>>>>>>> As you alluded to above, it might be nice to have a > > >>> separate > > >>>>>>>>>>>> broker-registration API to avoid overloading the metadata > > >>>>> fetch > > >>>>>>> API. > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Hi David, > > >>>>>>>>>>> > > >>>>>>>>>>> Thanks for taking a look. > > >>>>>>>>>>> > > >>>>>>>>>>> I removed the sentence about MetadataFetch also serving as > > >>> the > > >>>>>>> broker > > >>>>>>>>>>> registration API. I think I agree that we will probably > > >>> want a > > >>>>>>>>>>> separate RPC to fill this role. We will have a follow-on > > >>> KIP > > >>>>> that > > >>>>>>> will > > >>>>>>>>>>> go into more detail about metadata propagation and > > >>> registration > > >>>>> in > > >>>>>>> the > > >>>>>>>>>>> post-ZK world. That KIP will also have a full description > > >>> of > > >>>>> the > > >>>>>>>>>>> registration RPC, etc. For now, I think the important part > > >>> for > > >>>>>>> KIP-500 > > >>>>>>>>>>> is that the broker registers with the controller quorum. O= n > > >>>>>>>>>>> registration, the controller quorum assigns it a new broker > > >>>>> epoch, > > >>>>>>>>>>> which can distinguish successive broker incarnations. > > >>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> When a broker gets a metadata delta, will it be a > > >>> sequence of > > >>>>>>> deltas > > >>>>>>>>>> since > > >>>>>>>>>>>> the last update or a cumulative delta since the last > > >>> update? > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> It will be a sequence of deltas. Basically, the broker > > >>> will be > > >>>>>>> reading > > >>>>>>>>>>> from the metadata log. > > >>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> Will we include any kind of integrity check on the deltas > > >>> to > > >>>>>>> ensure > > >>>>>>>>>> the brokers > > >>>>>>>>>>>> have applied them correctly? Perhaps this will be > > >>> addressed in > > >>>>>>> one of > > >>>>>>>>>> the > > >>>>>>>>>>>> follow-on KIPs. > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> In general, we will have checksums on the metadata that we > > >>>>> fetch. > > >>>>>>> This > > >>>>>>>>>>> is similar to how we have checksums on regular data. Or if > > >>> the > > >>>>>>>>>>> question is about catching logic errors in the metadata > > >>> handling > > >>>>>>> code, > > >>>>>>>>>>> that sounds more like something that should be caught by > > >>> test > > >>>>> cases. > > >>>>>>>>>>> > > >>>>>>>>>>> best, > > >>>>>>>>>>> Colin > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>>> Thanks! > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Fri, Aug 9, 2019 at 1:17 PM Colin McCabe < > > >>>>> cmccabe@apache.org> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Hi Mickael, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Thanks for taking a look. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I don't think we want to support that kind of > > >>> multi-tenancy > > >>>>> at > > >>>>>>> the > > >>>>>>>>>>>>> controller level. If the cluster is small enough that > > >>> we > > >>>>> want > > >>>>>>> to > > >>>>>>>>>> pack the > > >>>>>>>>>>>>> controller(s) with something else, we could run them > > >>>>> alongside > > >>>>>>> the > > >>>>>>>>>> brokers, > > >>>>>>>>>>>>> or possibly inside three of the broker JVMs. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> best, > > >>>>>>>>>>>>> Colin > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Wed, Aug 7, 2019, at 10:37, Mickael Maison wrote: > > >>>>>>>>>>>>>> Thank Colin for kickstarting this initiative. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Just one question. > > >>>>>>>>>>>>>> - A nice feature of Zookeeper is the ability to use > > >>>>> chroots > > >>>>>>> and > > >>>>>>>>>> have > > >>>>>>>>>>>>>> several Kafka clusters use the same Zookeeper > > >>> ensemble. Is > > >>>>>>> this > > >>>>>>>>>>>>>> something we should keep? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Thanks > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Mon, Aug 5, 2019 at 7:44 PM Colin McCabe < > > >>>>>>> cmccabe@apache.org> > > >>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, Aug 5, 2019, at 10:02, Tom Bentley wrote: > > >>>>>>>>>>>>>>>> Hi Colin, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thanks for the KIP. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Currently ZooKeeper provides a convenient > > >>> notification > > >>>>>>>>>> mechanism for > > >>>>>>>>>>>>>>>> knowing that broker and topic configuration has > > >>>>> changed. > > >>>>>>> While > > >>>>>>>>>>>>> KIP-500 does > > >>>>>>>>>>>>>>>> suggest that incremental metadata update is > > >>> expected > > >>>>> to > > >>>>>>> come > > >>>>>>>>>> to > > >>>>>>>>>>>>> clients > > >>>>>>>>>>>>>>>> eventually, that would seem to imply that for some > > >>>>> number > > >>>>>>> of > > >>>>>>>>>>>>> releases there > > >>>>>>>>>>>>>>>> would be no equivalent mechanism for knowing about > > >>>>> config > > >>>>>>>>>> changes. > > >>>>>>>>>>>>> Is there > > >>>>>>>>>>>>>>>> any thinking at this point about how a similar > > >>>>>>> notification > > >>>>>>>>>> might be > > >>>>>>>>>>>>>>>> provided in the future? > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> We could eventually have some inotify-like mechanism > > >>>>> where > > >>>>>>>>>> clients > > >>>>>>>>>>>>> could register interest in various types of events and > > >>> got > > >>>>>>> notified > > >>>>>>>>>> when > > >>>>>>>>>>>>> they happened. Reading the metadata log is conceptually > > >>>>> simple. > > >>>>>>>>>> The main > > >>>>>>>>>>>>> complexity would be in setting up an API that made > > >>> sense and > > >>>>>>> that > > >>>>>>>>>> didn't > > >>>>>>>>>>>>> unduly constrain future implementations. We'd have to > > >>> think > > >>>>>>>>>> carefully > > >>>>>>>>>>>>> about what the real use-cases for this were, though. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> best, > > >>>>>>>>>>>>>>> Colin > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Tom > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Mon, Aug 5, 2019 at 3:49 PM Viktor > > >>> Somogyi-Vass < > > >>>>>>>>>>>>> viktorsomogyi@gmail.com> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Hey Colin, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> I think this is a long-awaited KIP, thanks for > > >>>>> driving > > >>>>>>> it. > > >>>>>>>>>> I'm > > >>>>>>>>>>>>> excited to > > >>>>>>>>>>>>>>>>> see this in Kafka once. I collected my questions > > >>>>> (and I > > >>>>>>>>>> accept the > > >>>>>>>>>>>>> "TBD" > > >>>>>>>>>>>>>>>>> answer as they might be a bit deep for this high > > >>>>> level > > >>>>>>> :) ). > > >>>>>>>>>>>>>>>>> 1.) Are there any specific reasons for the > > >>>>> Controller > > >>>>>>> just > > >>>>>>>>>>>>> periodically > > >>>>>>>>>>>>>>>>> persisting its state on disk periodically > > >>> instead of > > >>>>>>>>>>>>> asynchronously with > > >>>>>>>>>>>>>>>>> every update? Wouldn't less frequent saves > > >>> increase > > >>>>> the > > >>>>>>>>>> chance for > > >>>>>>>>>>>>> missing > > >>>>>>>>>>>>>>>>> a state change if the controller crashes > > >>> between two > > >>>>>>> saves? > > >>>>>>>>>>>>>>>>> 2.) Why can't we allow brokers to fetch metadata > > >>>>> from > > >>>>>>> the > > >>>>>>>>>> follower > > >>>>>>>>>>>>>>>>> controllers? I assume that followers would have > > >>>>>>> up-to-date > > >>>>>>>>>>>>> information > > >>>>>>>>>>>>>>>>> therefore brokers could fetch from there in > > >>> theory. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>> Viktor > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Sun, Aug 4, 2019 at 6:58 AM Boyang Chen < > > >>>>>>>>>>>>> reluctanthero104@gmail.com> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks for explaining Ismael! Breaking down > > >>> into > > >>>>>>>>>> follow-up KIPs > > >>>>>>>>>>>>> sounds > > >>>>>>>>>>>>>>>>> like > > >>>>>>>>>>>>>>>>>> a good idea. > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Sat, Aug 3, 2019 at 10:14 AM Ismael Juma < > > >>>>>>>>>> ismael@juma.me.uk> > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Hi Boyang, > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Yes, there will be several KIPs that will > > >>>>> discuss > > >>>>>>> the > > >>>>>>>>>> items you > > >>>>>>>>>>>>>>>>> describe > > >>>>>>>>>>>>>>>>>> in > > >>>>>>>>>>>>>>>>>>> detail. Colin, it may be helpful to make > > >>> this > > >>>>> clear > > >>>>>>> in > > >>>>>>>>>> the KIP > > >>>>>>>>>>>>> 500 > > >>>>>>>>>>>>>>>>>>> description. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Ismael > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Sat, Aug 3, 2019 at 9:32 AM Boyang Chen < > > >>>>>>>>>>>>> reluctanthero104@gmail.com > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks Colin for initiating this important > > >>>>> effort! > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> One question I have is whether we have a > > >>>>> session > > >>>>>>>>>> discussing > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> controller > > >>>>>>>>>>>>>>>>>>>> failover in the new architecture? I know > > >>> we > > >>>>> are > > >>>>>>> using > > >>>>>>>>>> Raft > > >>>>>>>>>>>>> protocol > > >>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>> failover, yet it's still valuable to > > >>> discuss > > >>>>> the > > >>>>>>>>>> steps new > > >>>>>>>>>>>>> cluster is > > >>>>>>>>>>>>>>>>>>> going > > >>>>>>>>>>>>>>>>>>>> to take to reach the stable stage again, > > >>> so > > >>>>> that > > >>>>>>> we > > >>>>>>>>>> could > > >>>>>>>>>>>>> easily > > >>>>>>>>>>>>>>>>>> measure > > >>>>>>>>>>>>>>>>>>>> the availability of the metadata servers. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Another suggestion I have is to write a > > >>>>>>> step-by-step > > >>>>>>>>>> design > > >>>>>>>>>>>>> doc like > > >>>>>>>>>>>>>>>>>> what > > >>>>>>>>>>>>>>>>>>>> we did in KIP-98 > > >>>>>>>>>>>>>>>>>>>> < > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>> > > >>>>> > > >>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+D= elivery+and+Transactional+Messaging > > >>>>>>>>>>>>>>>>>>>>> , > > >>>>>>>>>>>>>>>>>>>> including the new request protocols and > > >>> how > > >>>>> they > > >>>>>>> are > > >>>>>>>>>>>>> interacting in > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> new > > >>>>>>>>>>>>>>>>>>>> cluster. For a complicated change like > > >>> this, > > >>>>> an > > >>>>>>>>>>>>> implementation design > > >>>>>>>>>>>>>>>>>> doc > > >>>>>>>>>>>>>>>>>>>> help a lot in the review process, > > >>> otherwise > > >>>>> most > > >>>>>>>>>> discussions > > >>>>>>>>>>>>> we have > > >>>>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>>>> focus on high level and lose important > > >>>>> details as > > >>>>>>> we > > >>>>>>>>>>>>> discover them in > > >>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>> post-agreement phase. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Boyang > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> On Fri, Aug 2, 2019 at 5:17 PM Colin > > >>> McCabe < > > >>>>>>>>>>>>> cmccabe@apache.org> > > >>>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> On Fri, Aug 2, 2019, at 16:33, Jose > > >>> Armando > > >>>>>>> Garcia > > >>>>>>>>>> Sancio > > >>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>>>>> Thanks Colin for the detail KIP. I > > >>> have a > > >>>>> few > > >>>>>>>>>> comments > > >>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> questions. > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In the KIP's Motivation and Overview > > >>> you > > >>>>>>>>>> mentioned the > > >>>>>>>>>>>>>>>>> LeaderAndIsr > > >>>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>>>>> UpdateMetadata RPC. For example, > > >>> "updates > > >>>>>>> which > > >>>>>>>>>> the > > >>>>>>>>>>>>> controller > > >>>>>>>>>>>>>>>>>>> pushes, > > >>>>>>>>>>>>>>>>>>>>> such > > >>>>>>>>>>>>>>>>>>>>>> as LeaderAndIsr and UpdateMetadata > > >>>>> messages". > > >>>>>>> Is > > >>>>>>>>>> your > > >>>>>>>>>>>>> thinking > > >>>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>> we > > >>>>>>>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>>>>>> use MetadataFetch as a replacement to > > >>> just > > >>>>>>>>>>>>> UpdateMetadata only > > >>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>> add > > >>>>>>>>>>>>>>>>>>>>>> topic configuration in this state? > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Hi Jose, > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Thanks for taking a look. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> The goal is for MetadataFetchRequest to > > >>>>> replace > > >>>>>>> both > > >>>>>>>>>>>>>>>>>>> LeaderAndIsrRequest > > >>>>>>>>>>>>>>>>>>>>> and UpdateMetadataRequest. Topic > > >>>>> configurations > > >>>>>>>>>> would be > > >>>>>>>>>>>>> fetched > > >>>>>>>>>>>>>>>>>> along > > >>>>>>>>>>>>>>>>>>>>> with the other metadata. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In the section "Broker Metadata > > >>>>> Management", > > >>>>>>> you > > >>>>>>>>>> mention > > >>>>>>>>>>>>> "Just > > >>>>>>>>>>>>>>>>> like > > >>>>>>>>>>>>>>>>>>>> with > > >>>>>>>>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>>>>>>>> fetch request, the broker will track > > >>> the > > >>>>>>> offset > > >>>>>>>>>> of the > > >>>>>>>>>>>>> last > > >>>>>>>>>>>>>>>>> updates > > >>>>>>>>>>>>>>>>>>> it > > >>>>>>>>>>>>>>>>>>>>>> fetched". To keep the log consistent > > >>> Raft > > >>>>>>>>>> requires that > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> followers > > >>>>>>>>>>>>>>>>>>>>> keep > > >>>>>>>>>>>>>>>>>>>>>> all of the log entries (term/epoch and > > >>>>> offset) > > >>>>>>>>>> that are > > >>>>>>>>>>>>> after the > > >>>>>>>>>>>>>>>>>>>>>> highwatermark. Any log entry before > > >>> the > > >>>>>>>>>> highwatermark > > >>>>>>>>>>>>> can be > > >>>>>>>>>>>>>>>>>>>>>> compacted/snapshot. Do we expect the > > >>>>>>>>>> MetadataFetch API > > >>>>>>>>>>>>> to only > > >>>>>>>>>>>>>>>>>> return > > >>>>>>>>>>>>>>>>>>>> log > > >>>>>>>>>>>>>>>>>>>>>> entries up to the highwatermark? > > >>> Unlike > > >>>>> the > > >>>>>>> Raft > > >>>>>>>>>>>>> replication API > > >>>>>>>>>>>>>>>>>>> which > > >>>>>>>>>>>>>>>>>>>>>> will replicate/fetch log entries > > >>> after the > > >>>>>>>>>> highwatermark > > >>>>>>>>>>>>> for > > >>>>>>>>>>>>>>>>>>> consensus? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Good question. Clearly, we shouldn't > > >>> expose > > >>>>>>>>>> metadata > > >>>>>>>>>>>>> updates to > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>> brokers until they've been stored on a > > >>>>> majority > > >>>>>>> of > > >>>>>>>>>> the > > >>>>>>>>>>>>> Raft nodes. > > >>>>>>>>>>>>>>>>>> The > > >>>>>>>>>>>>>>>>>>>>> most obvious way to do that, like you > > >>>>>>> mentioned, is > > >>>>>>>>>> to > > >>>>>>>>>>>>> have the > > >>>>>>>>>>>>>>>>>> brokers > > >>>>>>>>>>>>>>>>>>>>> only fetch up to the HWM, but not > > >>> beyond. > > >>>>> There > > >>>>>>>>>> might be > > >>>>>>>>>>>>> a more > > >>>>>>>>>>>>>>>>>> clever > > >>>>>>>>>>>>>>>>>>>> way > > >>>>>>>>>>>>>>>>>>>>> to do it by fetching the data, but not > > >>>>> having > > >>>>>>> the > > >>>>>>>>>> brokers > > >>>>>>>>>>>>> act on it > > >>>>>>>>>>>>>>>>>>> until > > >>>>>>>>>>>>>>>>>>>>> the HWM advances. I'm not sure if > > >>> that's > > >>>>> worth > > >>>>>>> it > > >>>>>>>>>> or > > >>>>>>>>>>>>> not. We'll > > >>>>>>>>>>>>>>>>>>> discuss > > >>>>>>>>>>>>>>>>>>>>> this more in a separate KIP that just > > >>>>> discusses > > >>>>>>>>>> just Raft. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In section "Broker Metadata > > >>> Management", > > >>>>> you > > >>>>>>>>>> mention "the > > >>>>>>>>>>>>>>>>>> controller > > >>>>>>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>>>>>> send a full metadata image rather > > >>> than a > > >>>>>>> series of > > >>>>>>>>>>>>> deltas". This > > >>>>>>>>>>>>>>>>>> KIP > > >>>>>>>>>>>>>>>>>>>>>> doesn't go into the set of operations > > >>> that > > >>>>>>> need > > >>>>>>>>>> to be > > >>>>>>>>>>>>> supported > > >>>>>>>>>>>>>>>>> on > > >>>>>>>>>>>>>>>>>>> top > > >>>>>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>>>>>> Raft but it would be interested if > > >>> this > > >>>>> "full > > >>>>>>>>>> metadata > > >>>>>>>>>>>>> image" > > >>>>>>>>>>>>>>>>> could > > >>>>>>>>>>>>>>>>>>> be > > >>>>>>>>>>>>>>>>>>>>>> express also as deltas. For example, > > >>>>> assuming > > >>>>>>> we > > >>>>>>>>>> are > > >>>>>>>>>>>>> replicating > > >>>>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>>>>>> map > > >>>>>>>>>>>>>>>>>>>>> this > > >>>>>>>>>>>>>>>>>>>>>> "full metadata image" could be a > > >>> sequence > > >>>>> of > > >>>>>>> "put" > > >>>>>>>>>>>>> operations > > >>>>>>>>>>>>>>>>>> (znode > > >>>>>>>>>>>>>>>>>>>>> create > > >>>>>>>>>>>>>>>>>>>>>> to borrow ZK semantics). > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> The full image can definitely be > > >>> expressed > > >>>>> as a > > >>>>>>> sum > > >>>>>>>>>> of > > >>>>>>>>>>>>> deltas. At > > >>>>>>>>>>>>>>>>>> some > > >>>>>>>>>>>>>>>>>>>>> point, the number of deltas will get > > >>> large > > >>>>>>> enough > > >>>>>>>>>> that > > >>>>>>>>>>>>> sending a > > >>>>>>>>>>>>>>>>> full > > >>>>>>>>>>>>>>>>>>>> image > > >>>>>>>>>>>>>>>>>>>>> is better, though. One question that > > >>> we're > > >>>>>>> still > > >>>>>>>>>> thinking > > >>>>>>>>>>>>> about is > > >>>>>>>>>>>>>>>>>> how > > >>>>>>>>>>>>>>>>>>>>> much of this can be shared with generic > > >>>>> Kafka > > >>>>>>> log > > >>>>>>>>>> code, > > >>>>>>>>>>>>> and how > > >>>>>>>>>>>>>>>>> much > > >>>>>>>>>>>>>>>>>>>> should > > >>>>>>>>>>>>>>>>>>>>> be different. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In section "Broker Metadata > > >>> Management", > > >>>>> you > > >>>>>>>>>> mention > > >>>>>>>>>>>>> "This > > >>>>>>>>>>>>>>>>> request > > >>>>>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>>>>>> double as a heartbeat, letting the > > >>>>> controller > > >>>>>>>>>> know that > > >>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> broker > > >>>>>>>>>>>>>>>>>> is > > >>>>>>>>>>>>>>>>>>>>>> alive". In section "Broker State > > >>>>> Machine", you > > >>>>>>>>>> mention > > >>>>>>>>>>>>> "The > > >>>>>>>>>>>>>>>>>>>> MetadataFetch > > >>>>>>>>>>>>>>>>>>>>>> API serves as this registration > > >>>>> mechanism". > > >>>>>>> Does > > >>>>>>>>>> this > > >>>>>>>>>>>>> mean that > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>>> MetadataFetch Request will optionally > > >>>>> include > > >>>>>>>>>> broker > > >>>>>>>>>>>>>>>>> configuration > > >>>>>>>>>>>>>>>>>>>>>> information? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I was originally thinking that the > > >>>>>>>>>> MetadataFetchRequest > > >>>>>>>>>>>>> should > > >>>>>>>>>>>>>>>>>> include > > >>>>>>>>>>>>>>>>>>>>> broker configuration information. > > >>> Thinking > > >>>>>>> about > > >>>>>>>>>> this > > >>>>>>>>>>>>> more, maybe > > >>>>>>>>>>>>>>>>> we > > >>>>>>>>>>>>>>>>>>>>> should just have a special registration > > >>> RPC > > >>>>> that > > >>>>>>>>>> contains > > >>>>>>>>>>>>> that > > >>>>>>>>>>>>>>>>>>>> information, > > >>>>>>>>>>>>>>>>>>>>> to avoid sending it over the wire all > > >>> the > > >>>>> time. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Does this also mean that MetadataFetch > > >>>>> request > > >>>>>>>>>> will > > >>>>>>>>>>>>> result in > > >>>>>>>>>>>>>>>>>>>>>> a "write"/AppendEntries through the > > >>> Raft > > >>>>>>>>>> replication > > >>>>>>>>>>>>> protocol > > >>>>>>>>>>>>>>>>>> before > > >>>>>>>>>>>>>>>>>>>> you > > >>>>>>>>>>>>>>>>>>>>>> can send the associated MetadataFetch > > >>>>>>> Response? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> I think we should require the broker to > > >>> be > > >>>>> out > > >>>>>>> of > > >>>>>>>>>> the > > >>>>>>>>>>>>> Offline state > > >>>>>>>>>>>>>>>>>>>> before > > >>>>>>>>>>>>>>>>>>>>> allowing it to fetch metadata, yes. So > > >>> the > > >>>>>>> separate > > >>>>>>>>>>>>> registration > > >>>>>>>>>>>>>>>>> RPC > > >>>>>>>>>>>>>>>>>>>>> should have completed first. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In section "Broker State", you mention > > >>>>> that a > > >>>>>>>>>> broker can > > >>>>>>>>>>>>>>>>> transition > > >>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>>>> online after it is caught with the > > >>>>> metadata. > > >>>>>>> What > > >>>>>>>>>> do you > > >>>>>>>>>>>>> mean by > > >>>>>>>>>>>>>>>>>>> this? > > >>>>>>>>>>>>>>>>>>>>>> Metadata is always changing. How does > > >>> the > > >>>>>>> broker > > >>>>>>>>>> know > > >>>>>>>>>>>>> that it is > > >>>>>>>>>>>>>>>>>>> caught > > >>>>>>>>>>>>>>>>>>>>> up > > >>>>>>>>>>>>>>>>>>>>>> since it doesn't participate in the > > >>>>> consensus > > >>>>>>> or > > >>>>>>>>>> the > > >>>>>>>>>>>>> advancement > > >>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>>> highwatermark? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> That's a good point. Being "caught up" > > >>> is > > >>>>>>> somewhat > > >>>>>>>>>> of a > > >>>>>>>>>>>>> fuzzy > > >>>>>>>>>>>>>>>>>> concept > > >>>>>>>>>>>>>>>>>>>>> here, since the brokers do not > > >>> participate > > >>>>> in > > >>>>>>> the > > >>>>>>>>>> metadata > > >>>>>>>>>>>>>>>>> consensus. > > >>>>>>>>>>>>>>>>>>> I > > >>>>>>>>>>>>>>>>>>>>> think ideally we would want to define > > >>> it in > > >>>>>>> terms > > >>>>>>>>>> of time > > >>>>>>>>>>>>> ("the > > >>>>>>>>>>>>>>>>>> broker > > >>>>>>>>>>>>>>>>>>>> has > > >>>>>>>>>>>>>>>>>>>>> all the updates from the last 2 > > >>> minutes", > > >>>>> for > > >>>>>>>>>> example.) > > >>>>>>>>>>>>> We should > > >>>>>>>>>>>>>>>>>>> spell > > >>>>>>>>>>>>>>>>>>>>> this out better in the KIP. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> In section "Start the controller > > >>> quorum > > >>>>>>> nodes", > > >>>>>>>>>> you > > >>>>>>>>>>>>> mention "Once > > >>>>>>>>>>>>>>>>>> it > > >>>>>>>>>>>>>>>>>>>> has > > >>>>>>>>>>>>>>>>>>>>>> taken over the /controller node, the > > >>>>> active > > >>>>>>>>>> controller > > >>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>> proceed > > >>>>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>>>> load > > >>>>>>>>>>>>>>>>>>>>>> the full state of ZooKeeper. It will > > >>>>> write > > >>>>>>> out > > >>>>>>>>>> this > > >>>>>>>>>>>>> information > > >>>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>>> quorum's metadata storage. After this > > >>>>> point, > > >>>>>>> the > > >>>>>>>>>>>>> metadata quorum > > >>>>>>>>>>>>>>>>>>> will > > >>>>>>>>>>>>>>>>>>>> be > > >>>>>>>>>>>>>>>>>>>>>> the metadata store of record, rather > > >>> than > > >>>>> the > > >>>>>>>>>> data in > > >>>>>>>>>>>>> ZooKeeper." > > >>>>>>>>>>>>>>>>>>>> During > > >>>>>>>>>>>>>>>>>>>>>> this migration do should we expect to > > >>>>> have a > > >>>>>>>>>> small period > > >>>>>>>>>>>>>>>>>> controller > > >>>>>>>>>>>>>>>>>>>>>> unavailability while the controller > > >>>>> replicas > > >>>>>>> this > > >>>>>>>>>> state > > >>>>>>>>>>>>> to all of > > >>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>>>> raft > > >>>>>>>>>>>>>>>>>>>>>> nodes in the controller quorum and we > > >>>>> buffer > > >>>>>>> new > > >>>>>>>>>>>>> controller API > > >>>>>>>>>>>>>>>>>>>> requests? > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> Yes, the controller would be unavailable > > >>>>> during > > >>>>>>> this > > >>>>>>>>>>>>> time. I don't > > >>>>>>>>>>>>>>>>>>> think > > >>>>>>>>>>>>>>>>>>>>> this will be that different from the > > >>> current > > >>>>>>> period > > >>>>>>>>>> of > > >>>>>>>>>>>>>>>>> unavailability > > >>>>>>>>>>>>>>>>>>>> when > > >>>>>>>>>>>>>>>>>>>>> a new controller starts up and needs to > > >>>>> load the > > >>>>>>>>>> full > > >>>>>>>>>>>>> state from > > >>>>>>>>>>>>>>>>> ZK. > > >>>>>>>>>>>>>>>>>>> The > > >>>>>>>>>>>>>>>>>>>>> main difference is that in this period, > > >>> we'd > > >>>>>>> have > > >>>>>>>>>> to write > > >>>>>>>>>>>>> to the > > >>>>>>>>>>>>>>>>>>>>> controller quorum rather than just to > > >>>>> memory. > > >>>>>>> But > > >>>>>>>>>> we > > >>>>>>>>>>>>> believe this > > >>>>>>>>>>>>>>>>>>> should > > >>>>>>>>>>>>>>>>>>>>> be pretty fast. > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> regards, > > >>>>>>>>>>>>>>>>>>>>> Colin > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>>> Thanks! > > >>>>>>>>>>>>>>>>>>>>>> -Jose > > >>>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> -- > > >>>>>>>>>>>> David Arthur > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > >>> > > >> > > > > > --0000000000001c13530591ac3fef--