From dev-return-92929-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Tue Apr 3 17:39:44 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 50CF418064D for ; Tue, 3 Apr 2018 17:39:41 +0200 (CEST) Received: (qmail 78384 invoked by uid 500); 3 Apr 2018 15:39:35 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 78369 invoked by uid 99); 3 Apr 2018 15:39:34 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Apr 2018 15:39:34 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BE42AC61CD for ; Tue, 3 Apr 2018 15:39:33 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.129 X-Spam-Level: ** X-Spam-Status: No, score=2.129 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id idO7JUB1IhT8 for ; Tue, 3 Apr 2018 15:39:23 +0000 (UTC) Received: from mail-ot0-f176.google.com (mail-ot0-f176.google.com [74.125.82.176]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 49C085F19C for ; Tue, 3 Apr 2018 15:39:22 +0000 (UTC) Received: by mail-ot0-f176.google.com with SMTP id p33-v6so13576710otp.11 for ; Tue, 03 Apr 2018 08:39:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=yb2F53AUifmmg7AVZ/5wok4zF1YGYBlsuU9sMdA5uJU=; b=q3zlV+tSDkYvrM+pu7yJ0YnID+nH5H6shM/iixM9W1kqDSvZrHcYlQ/UbiyPZlAuj6 6YwyIxEB5gQVTBKkXHNJe2MkZaw01jmvmBNRdZbkcs620bNxr7Zk/oazxpITw5GM1tyQ 2F7uzckFLzeHWoxgs8cnKY/2M00b+d8c17GGfxMfuSu0PD6xGUgtwoOQNDTiony+wyHu q18Zqjye+iybFRy4oK/V9ryFk1MVQODBGZtp0PXZ0aFXLMFB6KoTZ7DKEhacCY6f0hIq NcDrj0TK1ZORRWBErNJJLPTQCnFemsdpxwsmZKiyiaf8OnSZ/A36N0f8M49FLl7Gg7ll +bRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=yb2F53AUifmmg7AVZ/5wok4zF1YGYBlsuU9sMdA5uJU=; b=lhlrgZtmH8IgM5sdP31OWa5tePYNrWH+T9McXbw2BV/sA5mIwZdiiCd3Mfg0a6l7UX 0u6zMYlxM9o1oTJTR7G6lqu95zV/QM5x4uSd1kG9LTFXHOrepcGveNOWhHWEZGy81wUi tCtKM1G7XZrMGe32HSQLcNs2WgE5OEghSqFIZu+kplmyQhEmdO2k4kNQTEkietPwupUa u6G0g1BlLvRDlEcxY8/cALbb+c7IeNS+pVgMQTgNFCZwGnwMcnA0jMm6BHfOD88Y0T3K SWwTOBUcWOmsOrBQ9jv7santANXwRenvq6mS4nVI3ZKrC9OFW3rkS21O2T4rrAiMw7Lw NcGA== X-Gm-Message-State: ALQs6tCa/j9cUm3clC8KGHsox6Nv9FtnR298KYVbzaLW883nMlyvPQR5 7v+S591/Zw0L2Q4O7Lzdl63dDHNP0tSfojQqf2s= X-Google-Smtp-Source: AIpwx4/GeO3h+Qsyk9TKi+8Z2/laDkSitCd1LFPrrX7O6OyvHSqXuYVn34vzApH0dvy/TvrcMsWejce7ImY1xUAXPY4= X-Received: by 2002:a9d:2b54:: with SMTP id f20-v6mr7789654otd.277.1522769953772; Tue, 03 Apr 2018 08:39:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.201.55.66 with HTTP; Tue, 3 Apr 2018 08:38:53 -0700 (PDT) In-Reply-To: References: <5A93C9E0.4060908@trivago.com> <5A97A3EB.9050404@trivago.com> <5A9D3D69.9010808@trivago.com> <5A9F87C4.6020704@trivago.com> <5AA3A803.2040302@trivago.com> <5AA76DEB.8040601@trivago.com> <5AAA28F5.5060108@trivago.com> <5AB264DB.1080507@trivago.com> From: Dong Lin Date: Tue, 3 Apr 2018 08:38:53 -0700 Message-ID: Subject: Re: [DISCUSS] KIP-253: Support in-order message delivery with partition expansion To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="0000000000002b61730568f3827b" --0000000000002b61730568f3827b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hey John, Thanks much for your comments!! I have yet to go through the emails of John/Jun/Guozhang in detail. But let me present my idea for how to minimize the delay for state loading for stream use-case. For ease of understanding, let's assume that the initial partition number of input topics and change log topic are both 10. And initial number of stream processor is also 10. If we only increase initial partition number of input topics to 15 without changing number of stream processor, the current KIP already guarantees in-order delivery and no state needs to be moved between consumers for stream use-case. Next, let's say we want to increase the number of processor to expand the processing capacity for stream use-case. This requires us to move state between processors which will take time. Our goal is to minimize the impact (i.e. delay) for processing while we increase the number of processors. Note that stream processor generally includes both consumer and producer. In addition to consume from the input topic, consumer may also need to consume from change log topic on startup for recovery. And producer may produce state to the change log topic. The solution will include the following steps: 1) Increase partition number of the input topic from 10 to 15. Since the messages with the same key will still go to the same consume before and after the partition expansion, this step can be done without having to move state between processors. 2) Increase partition number of the change log topic from 10 to 15. Note that this step can also be done without impacting existing workflow. After we increase partition number of the change log topic, key space may split and some key will be produced to the newly-added partition. But the same key will still go to the same processor (i.e. consumer) before and after the partition. Thus this step can also be done without having to move state between processors. 3) Now, let's add 5 new consumers whose groupId is different from the existing processor's groupId. Thus these new consumers will not impact existing workflow. Each of these new consumers should consume two partitions from the earliest offset, where these two partitions are the same partitions that will be consumed if the consumers have the same groupId as the existing processor's groupId. For example, the first of the five consumers will consume partition 0 and partition 10. The purpose of these consumers is to rebuild the state (e.g. RocksDB) for the processors in advance. Also note that, by design of the current KIP, each consume will consume the existing partition of the change log topic up to the offset before the partition expansion. Then they will only need to consume the state of the new partition of the change log topic. 4) After consumers have caught up in step 3), we should stop these consumers and add 5 new processors to the stream processing job. These 5 new processors should run in the same location as the previous 5 consumers to re-use the state (e.g. RocksDB). And these processors' consumers should consume partitions of the change log topic from the committed offset the previous 5 consumers so that no state is missed. One important trick to note here is that, the mapping from partition to consumer should also use linear hashing. And we need to remember the initial number of processors in the job, 10 in this example, and use this number in the linear hashing algorithm. This is pretty much the same as how we use linear hashing to map key to partition. In this case, we get an identity map from partition -> processor, for both input topic and the change log topic. For example, processor 12 will consume partition 12 of the input topic and produce state to the partition 12 of the change log topic. There are a few important properties of this solution to note: - We can increase the number of partitions for input topic and the change log topic in any order asynchronously. - The expansion of the processors in a given job in step 4) only requires the step 3) for the same job. It does not require coordination across different jobs for step 3) and 4). Thus different jobs can independently expand there capacity without waiting for each other. - The logic for 1) and 2) is already supported in the current KIP. The logic for 3) and 4) appears to be independent of the core Kafka logic and can be implemented separately outside core Kafka. Thus the current KIP is probably sufficient after we agree on the efficiency and the correctness of the solution. We can have a separate KIP for Kafka Stream to support 3) and 4). Cheers, Dong On Mon, Apr 2, 2018 at 3:25 PM, Guozhang Wang wrote: > Hey guys, just sharing my two cents here (I promise it will be shorter th= an > John's article :). > > 0. Just to quickly recap, the main discussion point now is how to support > "key partitioning preservation" (John's #4 in topic characteristics above= ) > beyond the "single-key ordering preservation" that KIP-253 was originally > proposed to maintain (John's #6 above). > > 1. From the streams project, we are actively working on improving the > elastic scalability of the library. One of the key features is to decoupl= e > the input topics from the parallelism model of Streams: i.e. not enforcin= g > the topic to be partitioned by the key, not enforcing joining topics to b= e > co-partitioned, not relying the number of parallel tasks on the input top= ic > partitions. This can be achieved by re-shuffling on the input topics to > make sure key-partitioning / co-partitioning on the internal topics. Note > the re-shuffling task is purely stateless and hence does not require "key > partitioning preservation". Operational-wise it is similar to the "creati= ng > a new topic with new number of partitions, pipe the data to the new topic > and cut over consumers from old topics" idea, just that users can > optionally let Streams to handle such rather than doing it manually > themselves. There are a few more details on that regard but I will skip > since they are not directly related to this discussion. > > 2. Assuming that 1) above is done, then the only topics involved in the > scaling events are all input topics. For these topics the only producers = / > consumers of these topics are controlled by Streams clients themselves, a= nd > hence achieving "key partitioning preservation" is simpler than non-Strea= ms > scenarios: consumers know the partitioning scheme that producers are usin= g, > so that for their stateful operations it is doable to split the local sta= te > stores accordingly or execute backfilling on its own. Of course, if we > decide to do server-side backfilling, it can still help Streams to direct= ly > rely on that functionality. > > 3. As John mentioned, another way inside Streams is to do over-partitioni= ng > on all internal topics; then with 1) Streams would not rely on KIP-253 at > all. But personally I'd like to avoid it if possible to reduce Kafka side > footprint: say we overpartition each input topic up to 1k, with a > reasonable sized stateful topology it can still contribute to tens of > thousands of topics to the topic partition capacity of a single cluster. > > 4. Summing up 1/2/3, I think we should focus more on non-Streams users > writing their stateful computations with local states, and think whether = / > how we could enable "key partitioning preservation" for them easily, than > to think heavily for Streams library. People may have different opinion o= n > how common of a usage pattern it is (I think Jun might be suggesting that > for DIY apps people may more likely use remote states so that it is not a > problem for them). My opinion is that for non-Streams users such usage > pattern could still be large (think: if you are piping data from Kafka to > an external data storage which has single-writer requirements for each > single shard, even though it is not a stateful computational application = it > may still require "key partitioning preservation"), so I prefer to have > backfilling in our KIP than only exposing the API for expansion and > requires consumers to have pre-knowledge of the producer's partitioning > scheme. > > > > Guozhang > > > > On Thu, Mar 29, 2018 at 2:33 PM, John Roesler wrote: > > > Hey Dong, > > > > Congrats on becoming a committer!!! > > > > Since I just sent a novel-length email, I'll try and keep this one brie= f > ;) > > > > Regarding producer coordination, I'll grant that in that case, producer= s > > may coordinate among themselves to produce into the same topic or to > > produce co-partitioned topics. Nothing in KStreams or the Kafka ecosyst= em > > in general requires such coordination for correctness or in fact for an= y > > optional features, though, so I would not say that we require producer > > coordination of partition logic. If producers currently coordinate, it'= s > > completely optional and their own choice. > > > > Regarding the portability of partition algorithms, my observation is th= at > > systems requiring independent implementations of the same algorithm wit= h > > 100% correctness are a large source of risk and also a burden on those > who > > have to maintain them. If people could flawlessly implement algorithms = in > > actual software, the world would be a wonderful place indeed! For a > system > > as important and widespread as Kafka, I would recommend restricting > > limiting such requirements as aggressively as possible. > > > > I'd agree that we can always revisit decisions like allowing arbitrary > > partition functions, but of course, we shouldn't do that in a vacuum. > That > > feels like the kind of thing we'd need to proactively seek guidance fro= m > > the users list about. I do think that the general approach of saying th= at > > "if you use a custom partitioner, you cannot do partition expansion" is > > very reasonable (but I don't think we need to go that far with the > current > > proposal). It's similar to my statement in my email to Jun that in > > principle KStreams doesn't *need* backfill, we only need it if we want = to > > employ partition expansion. > > > > I reckon that the main motivation for backfill is to support KStreams u= se > > cases and also any other use cases involving stateful consumers. > > > > Thanks for your response, and congrats again! > > -John > > > > > > On Wed, Mar 28, 2018 at 1:34 AM, Dong Lin wrote: > > > > > Hey John, > > > > > > Great! Thanks for all the comment. It seems that we agree that the > > current > > > KIP is in good shape for core Kafka. IMO, what we have been discussin= g > in > > > the recent email exchanges is mostly about the second step, i.e. how = to > > > address problem for the stream use-case (or stateful processing in > > > general). > > > > > > I will comment inline. > > > > > > > > > > > > > > > On Tue, Mar 27, 2018 at 4:38 PM, John Roesler > wrote: > > > > > > > Thanks for the response, Dong. > > > > > > > > Here are my answers to your questions: > > > > > > > > - "Asking producers and consumers, or even two different producers, > to > > > > > share code like the partition function is a pretty huge ask. What > if > > > they > > > > > are using different languages?". It seems that today we already > > require > > > > > different producer's to use the same hash function -- otherwise > > > messages > > > > > with the same key will go to different partitions of the same top= ic > > > which > > > > > may cause problem for downstream consumption. So not sure if it > adds > > > any > > > > > more constraint by assuming consumers know the hash function of > > > producer. > > > > > Could you explain more why user would want to use a cusmtom > partition > > > > > function? Maybe we can check if this is something that can be > > supported > > > > in > > > > > the default Kafka hash function. Also, can you explain more why i= t > is > > > > > difficuilt to implement the same hash function in different > > languages? > > > > > > > > > > > > Sorry, I meant two different producers as in producers to two > different > > > > topics. This was in response to the suggestion that we already > require > > > > coordination among producers to different topics in order to achiev= e > > > > co-partitioning. I was saying that we do not (and should not). > > > > > > > > > It is probably common for producers of different team to produce > message > > to > > > the same topic. In order to ensure that messages with the same key go > to > > > same partition, we need producers of different team to share the same > > > partition algorithm, which by definition requires coordination among > > > producers of different teams in an organization. Even for producers o= f > > > different topics, it may be common to require producers to use the sa= me > > > partition algorithm in order to join two topics for stream processing= . > > Does > > > this make it reasonable to say we already require coordination across > > > producers? > > > > > > > > > > By design, consumers are currently ignorant of the partitioning > scheme. > > > It > > > > suffices to trust that the producer has partitioned the topic by ke= y, > > if > > > > they claim to have done so. If you don't trust that, or even if you > > just > > > > need some other partitioning scheme, then you must re-partition it > > > > yourself. Nothing we're discussing can or should change that. The > value > > > of > > > > backfill is that it preserves the ability for consumers to avoid > > > > re-partitioning before consuming, in the case where they don't need > to > > > > today. > > > > > > > > > > Regarding shared "hash functions", note that it's a bit inaccurate = to > > > talk > > > > about the "hash function" of the producer. Properly speaking, the > > > producer > > > > has only a "partition function". We do not know that it is a hash. > The > > > > producer can use any method at their disposal to assign a partition > to > > a > > > > record. The partition function obviously may we written in any > > > programming > > > > language, so in general it's not something that can be shared aroun= d > > > > without a formal spec or the ability to execute arbitrary executabl= es > > in > > > > arbitrary runtime environments. > > > > > > > > > > Yeah it is probably better to say partition algorithm. I guess it > should > > > not be difficult to implement same partition algorithms in different > > > languages, right? Yes we would need a formal specification of the > default > > > partition algorithm in the producer. I think that can be documented a= s > > part > > > of the producer interface. > > > > > > > > > > > > > > Why would a producer want a custom partition function? I don't > know... > > > why > > > > did we design the interface so that our users can provide one? In > > > general, > > > > such systems provide custom partitioners because some data sets may > be > > > > unbalanced under the default or because they can provide some > > interesting > > > > functionality built on top of the partitioning scheme, etc. Having > > > provided > > > > this ability, I don't know why we would remove it. > > > > > > > > > > Yeah it is reasonable to assume that there was reason to support cust= om > > > partition function in producer. On the other hand it may also be > > reasonable > > > to revisit this interface and discuss whether we actually need to > support > > > custom partition function. If we don't have a good reason, we can > choose > > > not to support custom partition function in this KIP in a backward > > > compatible manner, i.e. user can still use custom partition function > but > > > they would not get the benefit of in-order delivery when there is > > partition > > > expansion. What do you think? > > > > > > > > > > > > > > - Besides the assumption that consumer needs to share the hash > function > > > of > > > > > producer, is there other organization overhead of the proposal in > the > > > > > current KIP? > > > > > > > > > > > > > It wasn't clear to me that KIP-253 currently required the producer > and > > > > consumer to share the partition function, or in fact that it had a > hard > > > > requirement to abandon the general partition function and use a > linear > > > hash > > > > function instead. > > > > > > > > > > In my reading, there is a requirement to track the metadata about > what > > > > partitions split into what other partitions during an expansion > > > operation. > > > > If the partition function is linear, this is easy. If not, you can > > always > > > > just record that all old partitions split into all new partitions. > This > > > has > > > > the effect of forcing all consumers to wait until the old epoch is > > > > completely consumed before starting on the new epoch. But this may > be a > > > > reasonable tradeoff, and it doesn't otherwise alter your design. > > > > > > > > You only mention the consumer needing to know that the partition > > function > > > > is linear, not what the actual function is, so I don't think your > > design > > > > actually calls for sharing the function. Plus, really all the > consumer > > > > needs is the metadata about what old-epoch partitions to wait for > > before > > > > consuming a new-epoch partition. This information is directly > captured > > in > > > > metadata, so I don't think it actually even cares whether the > partition > > > > function is linear or not. > > > > > > > > > > You are right that the current KIP does not mention it. My comment > > related > > > to the partition function coordination was related to support the > > > stream-use case which we have been discussing so far. > > > > > > > > > > So, no, I really think KIP-253 is in good shape. I was really more > > > talking > > > > about the part of this thread that's outside of KIP-253's scope, > > namely, > > > > creating the possibility of backfilling partitions after expansion. > > > > > > > > > > Great! Can you also confirm that the main motivation for backfilling > > > partitions after expansion is to support the stream use-case? > > > > > > > > > > - Currently producer can forget about the message that has been > > > > > acknowledged by the broker. Thus the producer probably does not > know > > > most > > > > > of the exiting messages in topic, including those messages produc= ed > > by > > > > > other producers. We can have the owner of the producer to > > > split+backfill. > > > > > In my opion it will be a new program that wraps around the existi= ng > > > > > producer and consumer classes. > > > > > > > > > > > > > This sounds fine by me! > > > > > > > > Really, I was just emphasizing that the part of the organization th= at > > > > produces a topic shouldn't have to export their partition function = to > > the > > > > part(s) of the organization (or other organizations) that consume t= he > > > > topic. Whether the backfill operation goes into the Producer > interface > > is > > > > secondary, I think. > > > > > > > > - Regarding point 5. The argument is in favor of the split+backfill > but > > > for > > > > > changelog topic. And it intends to address the problem for stream > > > > use-case > > > > > in general. In this KIP we will provide interface (i.e. > > > > > PartitionKeyRebalanceListener in the KIP) to be used by sream > > use-case > > > > and > > > > > the goal is that user can flush/re-consume the state as part of t= he > > > > > interface implementation regardless of whether there is change lo= g > > > topic. > > > > > > > > > > Maybe you are suggesting that the main reason to do split+backfil= l > of > > > > input > > > > > topic is to support log compacted topics? You mentioned in Point = 1 > > that > > > > log > > > > > compacted topics is out of the scope of this KIP. Maybe I could > > > > understand > > > > > your position better. Regarding Jan's proposal to split partition= s > > with > > > > > backfill, do you think this should replace the proposal in the > > existing > > > > > KIP, or do you think this is something that we should do in > addition > > to > > > > the > > > > > existing KIP? > > > > > > > > > > > > > I think that interface is a good/necessary component of KIP-253. > > > > > > > > I personally (FWIW) feel that KIP-253 is appropriately scoped, but = I > do > > > > think its utility will be limited unless there is a later KIP > offering > > > > backfill. But, maybe unlike Jan, I think it makes sense to try and > > tackle > > > > the ordering problem independently of backfill, so I'm in support o= f > > the > > > > current KIP. > > > > > > > > - Regarding point 6. I guess we can agree that it is better not to > have > > > the > > > > > performance overhread of copying the input data. Before we discus= s > > more > > > > on > > > > > whether the performance overhead is acceptable or not, I am tryin= g > to > > > > > figure out what is the benefit of introducing this overhread. You > > > > mentioned > > > > > that the benefit is the loose organizational coupling. By > > > "organizational > > > > > coupling", are you referring to the requirement that consumer nee= ds > > to > > > > know > > > > > the hash function of producer? If so, maybe we can discuss the > > use-case > > > > of > > > > > custom partiton function and see whether we can find a way to > support > > > > such > > > > > use-case without having to copy the input data. > > > > > > > > > > > > > I'm not too sure about what an "input" is in this sense, since we a= re > > > just > > > > talking about topics. Actually the point I was making there is that > > > AKAICT > > > > the performance overhead of a backfill is less than any other optio= n, > > > > assuming you split partitions rarely. > > > > > > > > > > By "input" I was referring to source Kafka topic of a stream processi= ng > > > job. > > > > > > > > > > Separately, yes, "organizational coupling" increases if producers a= nd > > > > consumers have to share code, such as the partition function. This > > would > > > > not be the case if producers could only pick from a menu of a few > > > > well-known partition functions, but I think this is a poor tradeoff= . > > > > > > > > > > Maybe we can revisit the custom partition function and see whether we > > > actually need it? Otherwise, I am concerned that every user will pay > the > > > overhead of data movement to support something that was not really > needed > > > for most users. > > > > > > > > > > > > > > To me, this is two strong arguments in favor of backfill being less > > > > expensive than no backfill, but again, I think that particular deba= te > > > comes > > > > after KIP-253, so I don't want to create the impression of oppositi= on > > to > > > > your proposal. > > > > > > > > > > > > Finally, to respond to a new email I just noticed: > > > > > > > > > BTW, here is my understanding of the scope of this KIP. We want t= o > > > allow > > > > > consumers to always consume messages with the same key from the > same > > > > > producer in the order they are produced. And we need to provide a > way > > > for > > > > > stream use-case to be able to flush/load state when messages with > the > > > > same > > > > > key are migrated between consumers. In addition to ensuring that > this > > > > goal > > > > > is correctly supported, we should do our best to keep the > performance > > > and > > > > > organization overhead of this KIP as low as possible. > > > > > > > > > > > > > I think we're on the same page there! In fact, I would generalize a > > > little > > > > more and say that the mechanism you've designed provides *all > > consumers* > > > > the ability "to flush/load state when messages with the same key ar= e > > > > migrated between consumers", not just Streams. > > > > > > > > > > Thanks for all the comment! > > > > > > > > > > > > > > Thanks for the discussion, > > > > -John > > > > > > > > > > > > > > > > On Tue, Mar 27, 2018 at 3:14 PM, Dong Lin > wrote: > > > > > > > > > Hey John, > > > > > > > > > > Thanks much for the detailed comments. Here are my thoughts: > > > > > > > > > > - The need to delete messages from log compacted topics is mainly > for > > > > > performance (e.g. storage space) optimization than for correctnes= s > > for > > > > this > > > > > KIP. I agree that we probably don't need to focus on this in our > > > > discussion > > > > > since it is mostly for performance optimization. > > > > > > > > > > - "Asking producers and consumers, or even two different producer= s, > > to > > > > > share code like the partition function is a pretty huge ask. What > if > > > they > > > > > are using different languages?". It seems that today we already > > require > > > > > different producer's to use the same hash function -- otherwise > > > messages > > > > > with the same key will go to different partitions of the same top= ic > > > which > > > > > may cause problem for downstream consumption. So not sure if it > adds > > > any > > > > > more constraint by assuming consumers know the hash function of > > > producer. > > > > > Could you explain more why user would want to use a cusmtom > partition > > > > > function? Maybe we can check if this is something that can be > > supported > > > > in > > > > > the default Kafka hash function. Also, can you explain more why i= t > is > > > > > difficuilt to implement the same hash function in different > > languages? > > > > > > > > > > - Besides the assumption that consumer needs to share the hash > > function > > > > of > > > > > producer, is there other organization overhead of the proposal in > the > > > > > current KIP? > > > > > > > > > > - Currently producer can forget about the message that has been > > > > > acknowledged by the broker. Thus the producer probably does not > know > > > most > > > > > of the exiting messages in topic, including those messages produc= ed > > by > > > > > other producers. We can have the owner of the producer to > > > split+backfill. > > > > > In my opion it will be a new program that wraps around the existi= ng > > > > > producer and consumer classes. > > > > > > > > > > - Regarding point 5. The argument is in favor of the split+backfi= ll > > but > > > > for > > > > > changelog topic. And it intends to address the problem for stream > > > > use-case > > > > > in general. In this KIP we will provide interface (i.e. > > > > > PartitionKeyRebalanceListener in the KIP) to be used by sream > > use-case > > > > and > > > > > the goal is that user can flush/re-consume the state as part of t= he > > > > > interface implementation regardless of whether there is change lo= g > > > topic. > > > > > > > > > > Maybe you are suggesting that the main reason to do split+backfil= l > of > > > > input > > > > > topic is to support log compacted topics? You mentioned in Point = 1 > > that > > > > log > > > > > compacted topics is out of the scope of this KIP. Maybe I could > > > > understand > > > > > your position better. Regarding Jan's proposal to split partition= s > > with > > > > > backfill, do you think this should replace the proposal in the > > existing > > > > > KIP, or do you think this is something that we should do in > addition > > to > > > > the > > > > > existing KIP? > > > > > > > > > > - Regarding point 6. I guess we can agree that it is better not t= o > > have > > > > the > > > > > performance overhread of copying the input data. Before we discus= s > > more > > > > on > > > > > whether the performance overhead is acceptable or not, I am tryin= g > to > > > > > figure out what is the benefit of introducing this overhread. You > > > > mentioned > > > > > that the benefit is the loose organizational coupling. By > > > "organizational > > > > > coupling", are you referring to the requirement that consumer nee= ds > > to > > > > know > > > > > the hash function of producer? If so, maybe we can discuss the > > use-case > > > > of > > > > > custom partiton function and see whether we can find a way to > support > > > > such > > > > > use-case without having to copy the input data. > > > > > > > > > > Thanks, > > > > > Dong > > > > > > > > > > > > > > > On Tue, Mar 27, 2018 at 11:34 AM, John Roesler > > > > wrote: > > > > > > > > > > > Hey Dong and Jun, > > > > > > > > > > > > Thanks for the thoughtful responses. If you don't mind, I'll mi= x > my > > > > > replies > > > > > > together to try for a coherent response. I'm not too familiar > with > > > > > > mailing-list etiquette, though. > > > > > > > > > > > > I'm going to keep numbering my points because it makes it easy > for > > > you > > > > > all > > > > > > to respond. > > > > > > > > > > > > Point 1: > > > > > > As I read it, KIP-253 is *just* about properly fencing the > > producers > > > > and > > > > > > consumers so that you preserve the correct ordering of records > > during > > > > > > partition expansion. This is clearly necessary regardless of > > anything > > > > > else > > > > > > we discuss. I think this whole discussion about backfill, > > consumers, > > > > > > streams, etc., is beyond the scope of KIP-253. But it would be > > > > cumbersome > > > > > > to start a new thread at this point. > > > > > > > > > > > > I had missed KIP-253's Proposed Change #9 among all the > details... > > I > > > > > think > > > > > > this is a nice addition to the proposal. One thought is that it= 's > > > > > actually > > > > > > irrelevant whether the hash function is linear. This is simply = an > > > > > algorithm > > > > > > for moving a key from one partition to another, so the type of > hash > > > > > > function need not be a precondition. In fact, it also doesn't > > matter > > > > > > whether the topic is compacted or not, the algorithm works > > > regardless. > > > > > > > > > > > > I think this is a good algorithm to keep in mind, as it might > > solve a > > > > > > variety of problems, but it does have a downside: that the > producer > > > > won't > > > > > > know whether or not K1 was actually in P1, it just knows that K= 1 > > was > > > in > > > > > > P1's keyspace before the new epoch. Therefore, it will have to > > > > > > pessimistically send (K1,null) to P1 just in case. But the next > > time > > > K1 > > > > > > comes along, the producer *also* won't remember that it already > > > > retracted > > > > > > K1 from P1, so it will have to send (K1,null) *again*. By > > extension, > > > > > every > > > > > > time the producer sends to P2, it will also have to send a > > tombstone > > > to > > > > > P1, > > > > > > which is a pretty big burden. To make the situation worse, if > there > > > is > > > > a > > > > > > second split, say P2 becomes P2 and P3, then any key Kx belongi= ng > > to > > > P3 > > > > > > will also have to be retracted from P2 *and* P1, since the > producer > > > > can't > > > > > > know whether Kx had been last written to P2 or P1. Over a long > > period > > > > of > > > > > > time, this clearly becomes a issue, as the producer must send a= n > > > > > arbitrary > > > > > > number of retractions along with every update. > > > > > > > > > > > > In contrast, the proposed backfill operation has an end, and > after > > it > > > > > ends, > > > > > > everyone can afford to forget that there ever was a different > > > partition > > > > > > layout. > > > > > > > > > > > > Really, though, figuring out how to split compacted topics is > > beyond > > > > the > > > > > > scope of KIP-253, so I'm not sure #9 really even needs to be in > > this > > > > > KIP... > > > > > > We do need in-order delivery during partition expansion. It wou= ld > > be > > > > fine > > > > > > by me to say that you *cannot* expand partitions of a > log-compacted > > > > topic > > > > > > and call it a day. I think it would be better to tackle that in > > > another > > > > > > KIP. > > > > > > > > > > > > > > > > > > Point 2: > > > > > > Regarding whether the consumer re-shuffles its inputs, this is > > always > > > > on > > > > > > the table; any consumer who wants to re-shuffle its input is fr= ee > > to > > > do > > > > > so. > > > > > > But this is currently not required. It's just that the current > > > > high-level > > > > > > story with Kafka encourages the use of partitions as a unit of > > > > > concurrency. > > > > > > As long as consumers are single-threaded, they can happily > consume > > a > > > > > single > > > > > > partition without concurrency control of any kind. This is a ke= y > > > aspect > > > > > to > > > > > > this system that lets folks design high-throughput systems on t= op > > of > > > it > > > > > > surprisingly easily. If all consumers were instead > > > encouraged/required > > > > to > > > > > > implement a repartition of their own, then the consumer becomes > > > > > > significantly more complex, requiring either the consumer to > first > > > > > produce > > > > > > to its own intermediate repartition topic or to ensure that > > consumer > > > > > > threads have a reliable, high-bandwith channel of communication > > with > > > > > every > > > > > > other consumer thread. > > > > > > > > > > > > Either of those tradeoffs may be reasonable for a particular us= er > > of > > > > > Kafka, > > > > > > but I don't know if we're in a position to say that they are > > > reasonable > > > > > for > > > > > > *every* user of Kafka. > > > > > > > > > > > > > > > > > > Point 3: > > > > > > Regarding Jun's point about this use case, "(3) stateful and > > > > maintaining > > > > > > the > > > > > > states in a local store", I agree that they may use a framework > > > *like* > > > > > > Kafka Streams, but that is not the same as using Kafka Streams. > > This > > > is > > > > > why > > > > > > I think it's better to solve it in Core: because it is then > solved > > > for > > > > > > KStreams and also for everything else that facilitates local > state > > > > > > maintenance. To me, Streams is a member of the category of > "stream > > > > > > processing frameworks", which is itself a subcategory of "thing= s > > > > > requiring > > > > > > local state maintenence". I'm not sure if it makes sense to > assert > > > that > > > > > > Streams is a sufficient and practical replacement for everythin= g > in > > > > > "things > > > > > > requiring local state maintenence". > > > > > > > > > > > > But, yes, I do agree that per-key ordering is an absolute > > > requirement, > > > > > > therefore I think that KIP-253 itself is a necessary step. > > Regarding > > > > the > > > > > > coupling of the state store partitioning to the topic > partitioning, > > > > yes, > > > > > > this is an issue we are discussing solutions to right now. We m= ay > > go > > > > > ahead > > > > > > and introduce an overpartition layer on our inputs to solve it, > but > > > > then > > > > > > again, if we get the ability to split partitions with backfill, > we > > > may > > > > > not > > > > > > need to! > > > > > > > > > > > > > > > > > > Point 4: > > > > > > On this: > > > > > > > > > > > > > Regarding thought 2: If we don't care about the stream > use-case, > > > then > > > > > the > > > > > > > current KIP probably has already addressed problem without > > > requiring > > > > > > > consumer to know the partition function. If we care about the > > > stream > > > > > > > use-case, we already need coordination across producers of > > > different > > > > > > > topics, i.e. the same partition function needs to be used by > > > > producers > > > > > of > > > > > > > topics A and B in order to join topics A and B. Thus, it migh= t > be > > > > > > > reasonable to extend coordination a bit and say we need > > > coordination > > > > > > across > > > > > > > clients (i.e. producer and consumer), such that consumer know= s > > the > > > > > > > partition function used by producer. If we do so, then we can > let > > > > > > consumer > > > > > > > re-copy data for the change log topic using the same partitio= n > > > > function > > > > > > as > > > > > > > producer. This approach has lower overhead as compared to > having > > > > > producer > > > > > > > re-copy data of the input topic. > > > > > > > Also, producer currently does not need to know the data alrea= dy > > > > > produced > > > > > > to > > > > > > > the topic. If we let producer split/merge partition, it would > > > require > > > > > > > producer to consume the existing data, which intuitively is t= he > > > task > > > > of > > > > > > > consumer. > > > > > > > > > > > > > > > > > > I think we do care about use cases *like* Streams, I just don't > > think > > > > we > > > > > > should rely on Streams to implement a feature of Core like > > partition > > > > > > expansion. > > > > > > > > > > > > Note, though, that we (Streams) do not require coordination > across > > > > > > producers. If two topics are certified to be co-partitioned, th= en > > > > Streams > > > > > > apps can make use of that knowledge to optimize their topology > > > > (skipping > > > > > a > > > > > > repartition). But if they don't know whether they are > > co-partitioned, > > > > > then > > > > > > they'd better go ahead and repartition within the topology. Thi= s > is > > > the > > > > > > current state. > > > > > > > > > > > > A huge selling point of Kafka is enabling different parts of > > loosely > > > > > > coupled organizations to produce and consume data independently= . > > Some > > > > > > coordination between producers and consumers is necessary, like > > > > > > coordinating on the names of topics and their schemas. But > Kafka's > > > > value > > > > > > proposition w.r.t. ESBs, etc. is inversely proportional to the > > amount > > > > of > > > > > > coordination required. I think it behooves us to be extremely > > > skeptical > > > > > > about introducing any coordination beyond correctness protocols= . > > > > > > > > > > > > Asking producers and consumers, or even two different producers= , > to > > > > share > > > > > > code like the partition function is a pretty huge ask. What if > they > > > are > > > > > > using different languages? > > > > > > > > > > > > Comparing organizational overhead vs computational overhead, > there > > > are > > > > > > maybe two orders of magnitude difference between them. In other > > > words, > > > > I > > > > > > would happily take on the (linear) overhead of having the > producer > > > > > re-copy > > > > > > the data once during a re-partition in order to save the > > > organizational > > > > > > overhead of tying all the producers and consumers together acro= ss > > > > > multiple > > > > > > boundaries. > > > > > > > > > > > > On that last paragraph: note that the producer *did* know the > data > > it > > > > > > already produced. It handled it the first time around. Asking i= t > to > > > > > > re-produce it into a new partition layout is squarely within it= s > > > scope > > > > of > > > > > > capabilities. Contrast this with the alternative, asking the > > consumer > > > > to > > > > > > re-partition the data. I think this is even less intuitive, whe= n > > the > > > > > > partition function belongs to the producer. > > > > > > > > > > > > > > > > > > Point 5: > > > > > > Dong asked this: > > > > > > > > > > > > > For stream use-case that needs to increase consumer number, t= he > > > > > > > existing consumer can backfill the existing data in the chang= e > > log > > > > > topic > > > > > > to > > > > > > > the same change log topic with the new partition number, befo= re > > the > > > > new > > > > > > set > > > > > > > of consumers bootstrap state from the new partitions of the > > change > > > > log > > > > > > > topic, right? > > > > > > > > > > > > > > > > > > In this sense, the "consumer" is actually the producer of the > > > changelog > > > > > > topic, so if we support partition expansion + backfill as a > > > > > producer/broker > > > > > > operation, then it would be very straightforward for Streams to > > > split a > > > > > > state store. As you say, they would simply instruct the broker = to > > > split > > > > > the > > > > > > changelog topic's partitions, then backfill. Once the backfill = is > > > > ready, > > > > > > they can create a new crop of StandbyTasks to bootstrap the mor= e > > > > granular > > > > > > state stores and finally switch over to them when they are read= y. > > > > > > > > > > > > But this actually seems to be an argument in favor of > > split+backfill, > > > > so > > > > > > maybe I missed the point. > > > > > > > > > > > > You also asked me to explain why copying the "input" topic is > > better > > > > than > > > > > > copying the "changelog" topic. I think they are totally > > independent, > > > > > > actually. For one thing, you can't depend on the existence of a > > > > > "changelog" > > > > > > topic in general, only within Streams, but Kafka's user base > > clearly > > > > > > exceeds Streams's user base. Plus, you actually also can't depe= nd > > on > > > > the > > > > > > existence of a changelog topic within Streams, since that is an > > > > optional > > > > > > feature of *some* state store implementations. Even in the > > situation > > > > > where > > > > > > you do have a changelog topic in Streams, there may be use case= s > > > where > > > > it > > > > > > makes sense to expand the partitions of just the input, or just > the > > > > > > changelog. > > > > > > > > > > > > The ask for a Core feature of split+backfill is really about > > > supporting > > > > > the > > > > > > use case of splitting partitions in log-compacted topics, > > regardless > > > of > > > > > > whether that topic is an "input" or a "changelog" or anything > else > > > for > > > > > that > > > > > > matter. > > > > > > > > > > > > > > > > > > Point 6: > > > > > > On the concern about the performance overhead of copying data > > between > > > > the > > > > > > brokers, I think it's actually a bit overestimated. Splitting a > > > topic's > > > > > > partition is probably rare, certainly rarer in general than > > > > bootstrapping > > > > > > new consumers on that topic. If "bootstrapping new consumers" > means > > > > that > > > > > > they have to re-shuffle the data before they consume it, then y= ou > > > wind > > > > up > > > > > > copying the same record multiple times: > > > > > > > > > > > > (broker: input topic) -> (initial consumer) -> (broker: > repartition > > > > > topic) > > > > > > -> (real consumer) > > > > > > > > > > > > That's 3x, and it's also 3x for every new record after the spli= t > as > > > > well, > > > > > > since you don't get to stop repartitioning/reshuffling once you > > > start. > > > > > > > > > > > > Whereas if you do a backfill in something like the procedure I > > > > outlined, > > > > > > you only copy the prefix of the partition before the split, and > you > > > > send > > > > > it > > > > > > once to the producer and then once to the new generation > partition. > > > > Plus, > > > > > > assuming we're splitting the partition for the benefit of > > consumers, > > > > > > there's no reason we can't co-locate the post-split partitions = on > > the > > > > > same > > > > > > host as the pre-split partition, making the second copy a local > > > > > filesystem > > > > > > operation. > > > > > > > > > > > > Even if you follow these two copies up with bootstrapping a new > > > > consumer, > > > > > > it's still rare for this to occur, so you get to amortize these > > > copies > > > > > over > > > > > > the lifetime of the topic, whereas a reshuffle just keeps makin= g > > > copies > > > > > for > > > > > > every new event. > > > > > > > > > > > > And finally, I really do think that regardless of any performan= ce > > > > > concerns > > > > > > about this operation, if it preserves loose organizational > > coupling, > > > it > > > > > is > > > > > > certainly worth it. > > > > > > > > > > > > > > > > > > In conclusion: > > > > > > It might actually be a good idea for us to clarify the scope of > > > > KIP-253. > > > > > If > > > > > > we're all agreed that it's a good algorithm for allowing in-ord= er > > > > message > > > > > > delivery during partition expansion, then we can continue this > > > > discussion > > > > > > as a new KIP, something like "backfill with partition expansion= ". > > > This > > > > > > would let Dong proceed with KIP-253. On the other hand, if it > seems > > > > like > > > > > > this conversation may alter the design of KIP-253, then maybe w= e > > > > *should* > > > > > > just finish working it out. > > > > > > > > > > > > For my part, my only concern about KIP-253 is the one I raised > > > earlier. > > > > > > > > > > > > Thanks again, all, for considering these points, > > > > > > -John > > > > > > > > > > > > > > > > > > On Tue, Mar 27, 2018 at 2:10 AM, Dong Lin > > > wrote: > > > > > > > > > > > > > On Tue, Mar 27, 2018 at 12:04 AM, Dong Lin < > lindong28@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > > Hey Jan, > > > > > > > > > > > > > > > > Thanks for the enthusiasm in improving Kafka's design. Now > > that I > > > > > have > > > > > > > > read through your discussion with Jun, here are my thoughts= : > > > > > > > > > > > > > > > > - The latest proposal should with log compacted topics by > > > properly > > > > > > > > deleting old messages after a new message with the same key > is > > > > > > produced. > > > > > > > So > > > > > > > > it is probably not a concern anymore. Could you comment if > > there > > > is > > > > > > still > > > > > > > > issue? > > > > > > > > > > > > > > > > - I wrote the SEP-5 and I am pretty familiar with the > > motivation > > > > and > > > > > > the > > > > > > > > design of SEP-5. SEP-5 is probably orthornal to the > motivation > > of > > > > > this > > > > > > > KIP. > > > > > > > > The goal of SEP-5 is to allow user to increase task number = of > > an > > > > > > existing > > > > > > > > Samza job. But if we increase the partition number of input > > > topics, > > > > > > > > messages may still be consumed out-of-order by tasks in Sam= za > > > which > > > > > > cause > > > > > > > > incorrect result. Similarly, the approach you proposed does > not > > > > seem > > > > > to > > > > > > > > ensure that the messages can be delivered in order, even if > we > > > can > > > > > make > > > > > > > > sure that each consumer instance is assigned the set of new > > > > > partitions > > > > > > > > covering the same set of keys. > > > > > > > > > > > > > > > > > > > > > > Let me correct this comment. The approach of copying data to = a > > new > > > > > topic > > > > > > > can ensure in-order message delivery suppose we properly > migrate > > > > > offsets > > > > > > > from old topic to new topic. > > > > > > > > > > > > > > > > > > > > > > - I am trying to understand why it is better to copy the da= ta > > > > instead > > > > > > of > > > > > > > > copying the change log topic for streaming use-case. For co= re > > > Kafka > > > > > > > > use-case, and for the stream use-case that does not need to > > > > increase > > > > > > > > consumers, the current KIP already supports in-order delive= ry > > > > without > > > > > > the > > > > > > > > overhead of copying the data. For stream use-case that need= s > to > > > > > > increase > > > > > > > > consumer number, the existing consumer can backfill the > > existing > > > > data > > > > > > in > > > > > > > > the change log topic to the same change log topic with the > new > > > > > > partition > > > > > > > > number, before the new set of consumers bootstrap state fro= m > > the > > > > new > > > > > > > > partitions of the change log topic. If this solution works, > > then > > > > > could > > > > > > > you > > > > > > > > summarize the advantage of copying the data of input topic = as > > > > > compared > > > > > > to > > > > > > > > copying the change log topic? For example, does it enable > more > > > > > > use-case, > > > > > > > > simplify the implementation of Kafka library, or reduce the > > > > operation > > > > > > > > overhead etc? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Dong > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 21, 2018 at 6:57 AM, Jan Filipiak < > > > > > > Jan.Filipiak@trivago.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hi Jun, > > > > > > > >> > > > > > > > >> I was really seeing progress in our conversation but your > > latest > > > > > reply > > > > > > > is > > > > > > > >> just devastating. > > > > > > > >> I though we were getting close being on the same page now = it > > > feels > > > > > > like > > > > > > > >> we are in different libraries. > > > > > > > >> > > > > > > > >> I just quickly slam my answers in here. If they are to > brief I > > > am > > > > > > sorry > > > > > > > >> give me a ping and try to go into details more. > > > > > > > >> Just want to show that your pro/cons listing is broken. > > > > > > > >> > > > > > > > >> Best Jan > > > > > > > >> > > > > > > > >> and want to get rid of this horrible compromise > > > > > > > >> > > > > > > > >> > > > > > > > >> On 19.03.2018 05:48, Jun Rao wrote: > > > > > > > >> > > > > > > > >>> Hi, Jan, > > > > > > > >>> > > > > > > > >>> Thanks for the discussion. Great points. > > > > > > > >>> > > > > > > > >>> Let me try to summarize the approach that you are > proposing. > > On > > > > the > > > > > > > >>> broker > > > > > > > >>> side, we reshuffle the existing data in a topic from > current > > > > > > partitions > > > > > > > >>> to > > > > > > > >>> the new partitions. Once the reshuffle fully catches up, > > switch > > > > the > > > > > > > >>> consumers to start consuming from the new partitions. If = a > > > > consumer > > > > > > > needs > > > > > > > >>> to rebuild its local state (due to partition changes), le= t > > the > > > > > > consumer > > > > > > > >>> rebuild its state by reading all existing data from the n= ew > > > > > > partitions. > > > > > > > >>> Once all consumers have switches over, cut over the > producer > > to > > > > the > > > > > > new > > > > > > > >>> partitions. > > > > > > > >>> > > > > > > > >>> The pros for this approach are that : > > > > > > > >>> 1. There is just one way to rebuild the local state, whic= h > is > > > > > > simpler. > > > > > > > >>> > > > > > > > >> true thanks > > > > > > > >> > > > > > > > >>> > > > > > > > >>> The cons for this approach are: > > > > > > > >>> 1. Need to copy existing data. > > > > > > > >>> > > > > > > > >> Very unfair and not correct. It does not require you to co= py > > > over > > > > > > > >> existing data. It _allows_ you to copy all existing data. > > > > > > > >> > > > > > > > >> 2. The cutover of the producer is a bit complicated since = it > > > needs > > > > > to > > > > > > > >>> coordinate with all consumer groups. > > > > > > > >>> > > > > > > > >> Also not true. I explicitly tried to make clear that there > is > > > only > > > > > one > > > > > > > >> special consumer (in the case of actually copying data) > > > > coordination > > > > > > is > > > > > > > >> required. > > > > > > > >> > > > > > > > >>> 3. The rebuilding of the state in the consumer is from th= e > > > input > > > > > > topic, > > > > > > > >>> which can be more expensive than rebuilding from the > existing > > > > > state. > > > > > > > >>> > > > > > > > >> true, but rebuilding state is only required if you want to > > > > increase > > > > > > > >> processing power, so we assume this is at hand. > > > > > > > >> > > > > > > > >>> 4. The broker potentially has to know the partitioning > > > function. > > > > If > > > > > > > this > > > > > > > >>> needs to be customized at the topic level, it can be a bi= t > > > messy. > > > > > > > >>> > > > > > > > >> I would argue against having the operation being performed > by > > > the > > > > > > > broker. > > > > > > > >> This was not discussed yet but if you see my original emai= l > i > > > > > > suggested > > > > > > > >> otherwise from the beginning. > > > > > > > >> > > > > > > > >>> > > > > > > > >>> Here is an alternative approach by applying your idea not > in > > > the > > > > > > > broker, > > > > > > > >>> but in the consumer. When new partitions are added, we > don't > > > move > > > > > > > >>> existing > > > > > > > >>> data. In KStreams, we first reshuffle the new input data > to a > > > new > > > > > > topic > > > > > > > >>> T1 > > > > > > > >>> with the old number of partitions and feed T1's data to t= he > > > rest > > > > of > > > > > > the > > > > > > > >>> pipeline. In the meantime, KStreams reshuffles all existi= ng > > > data > > > > of > > > > > > the > > > > > > > >>> change capture topic to another topic C1 with the new > number > > of > > > > > > > >>> partitions. > > > > > > > >>> We can then build the state of the new tasks from C1. Onc= e > > the > > > > new > > > > > > > states > > > > > > > >>> have been fully built, we can cut over the consumption to > the > > > > input > > > > > > > topic > > > > > > > >>> and delete T1. This approach works with compacted topic > too. > > If > > > > an > > > > > > > >>> application reads from the beginning of a compacted topic= , > > the > > > > > > consumer > > > > > > > >>> will reshuffle the portion of the input when the number o= f > > > > > partitions > > > > > > > >>> doesn't match the number of tasks. > > > > > > > >>> > > > > > > > >> We all wipe this idea from our heads instantly. Mixing Ide= as > > > from > > > > an > > > > > > > >> argument is not a resolution strategy > > > > > > > >> just leads to horrible horrible software. > > > > > > > >> > > > > > > > >> > > > > > > > >>> The pros of this approach are: > > > > > > > >>> 1. No need to copy existing data. > > > > > > > >>> 2. Each consumer group can cut over to the new partitions > > > > > > > independently. > > > > > > > >>> 3. The state is rebuilt from the change capture topic, > which > > is > > > > > > cheaper > > > > > > > >>> than rebuilding from the input topic. > > > > > > > >>> 4. Only the KStreams job needs to know the partitioning > > > function. > > > > > > > >>> > > > > > > > >>> The cons of this approach are: > > > > > > > >>> 1. Potentially the same input topic needs to be reshuffle= d > > more > > > > > than > > > > > > > once > > > > > > > >>> in different consumer groups during the transition phase. > > > > > > > >>> > > > > > > > >>> What do you think? > > > > > > > >>> > > > > > > > >>> Thanks, > > > > > > > >>> > > > > > > > >>> Jun > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> On Thu, Mar 15, 2018 at 1:04 AM, Jan Filipiak < > > > > > > > Jan.Filipiak@trivago.com> > > > > > > > >>> wrote: > > > > > > > >>> > > > > > > > >>> Hi Jun, > > > > > > > >>>> > > > > > > > >>>> thank you for following me on these thoughts. It was > > important > > > > to > > > > > me > > > > > > > to > > > > > > > >>>> feel that kind of understanding for my arguments. > > > > > > > >>>> > > > > > > > >>>> What I was hoping for (I mentioned this earlier) is that > we > > > can > > > > > > model > > > > > > > >>>> the > > > > > > > >>>> case where we do not want to copy the data the exact sam= e > > way > > > as > > > > > the > > > > > > > >>>> case > > > > > > > >>>> when we do copy the data. Maybe you can peek into the > mails > > > > before > > > > > > to > > > > > > > >>>> see > > > > > > > >>>> more details for this. > > > > > > > >>>> > > > > > > > >>>> This means we have the same mechanism to transfer consum= er > > > > groups > > > > > to > > > > > > > >>>> switch topic. The offset mapping that would be generated > > would > > > > > even > > > > > > be > > > > > > > >>>> simpler End Offset of the Old topic =3D> offset 0 off al= l > the > > > > > > partitions > > > > > > > >>>> of > > > > > > > >>>> the new topic. Then we could model the transition of a > > > non-copy > > > > > > > >>>> expansion > > > > > > > >>>> the exact same way as a copy-expansion. > > > > > > > >>>> > > > > > > > >>>> I know this only works when topic growth by a factor. Bu= t > > the > > > > > > benefits > > > > > > > >>>> of > > > > > > > >>>> only growing by a factor are to strong anyways. See > > Clemens's > > > > hint > > > > > > and > > > > > > > >>>> remember that state reshuffling is entirely not needed i= f > > one > > > > > > doesn't > > > > > > > >>>> want > > > > > > > >>>> to grow processing power. > > > > > > > >>>> > > > > > > > >>>> I think these benefits should be clear, and that there i= s > > > > > basically > > > > > > no > > > > > > > >>>> downside to what is currently at hand but just makes > > > everything > > > > > > easy. > > > > > > > >>>> > > > > > > > >>>> One thing you need to know is. that if you do not offer > > > > > rebuilding a > > > > > > > log > > > > > > > >>>> compacted topic like i suggest that even if you have > > consumer > > > > > state > > > > > > > >>>> reshuffling. The topic is broken and can not be used to > > > > bootstrap > > > > > > new > > > > > > > >>>> consumers. They don't know if they need to apply a key > from > > > and > > > > > old > > > > > > > >>>> partition or not. This is a horrible downside I haven't > > seen a > > > > > > > solution > > > > > > > >>>> for > > > > > > > >>>> in the email conversation. > > > > > > > >>>> > > > > > > > >>>> I argue to: > > > > > > > >>>> > > > > > > > >>>> Only grow topic by a factor always. > > > > > > > >>>> Have the "no copy consumer" transition as the trivial ca= se > > of > > > > the > > > > > > > "copy > > > > > > > >>>> consumer transition". > > > > > > > >>>> If processors needs to be scaled, let them rebuild from > the > > > new > > > > > > topic > > > > > > > >>>> and > > > > > > > >>>> leave the old running in the mean time. > > > > > > > >>>> Do not implement key shuffling in streams. > > > > > > > >>>> > > > > > > > >>>> I hope I can convince you especially with the fact how I > > want > > > to > > > > > > > handle > > > > > > > >>>> consumer transition. I think > > > > > > > >>>> you didn't quite understood me there before. I think the > > term > > > > "new > > > > > > > >>>> topic" > > > > > > > >>>> intimidated you a little. > > > > > > > >>>> How we solve this on disc doesn't really matter, If the > data > > > > goes > > > > > > into > > > > > > > >>>> the > > > > > > > >>>> same Dir or a different Dir or anything. I do think that > it > > > > needs > > > > > to > > > > > > > >>>> involve at least rolling a new segment for the existing > > > > > partitions. > > > > > > > >>>> But most of the transitions should work without restarti= ng > > > > > > consumers. > > > > > > > >>>> (newer consumers with support for this). But with new > topic > > i > > > > just > > > > > > > meant > > > > > > > >>>> the topic that now has a different partition count. Plen= ty > > of > > > > ways > > > > > > to > > > > > > > >>>> handle that (versions, aliases) > > > > > > > >>>> > > > > > > > >>>> Hope I can further get my idea across. > > > > > > > >>>> > > > > > > > >>>> Best Jan > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> On 14.03.2018 02:45, Jun Rao wrote: > > > > > > > >>>> > > > > > > > >>>> Hi, Jan, > > > > > > > >>>>> > > > > > > > >>>>> Thanks for sharing your view. > > > > > > > >>>>> > > > > > > > >>>>> I agree with you that recopying the data potentially > makes > > > the > > > > > > state > > > > > > > >>>>> management easier since the consumer can just rebuild i= ts > > > state > > > > > > from > > > > > > > >>>>> scratch (i.e., no need for state reshuffling). > > > > > > > >>>>> > > > > > > > >>>>> On the flip slide, I saw a few disadvantages of the > > approach > > > > that > > > > > > you > > > > > > > >>>>> suggested. (1) Building the state from the input topic > from > > > > > scratch > > > > > > > is > > > > > > > >>>>> in > > > > > > > >>>>> general less efficient than state reshuffling. Let's sa= y > > one > > > > > > > computes a > > > > > > > >>>>> count per key from an input topic. The former requires > > > reading > > > > > all > > > > > > > >>>>> existing > > > > > > > >>>>> records in the input topic whereas the latter only > requires > > > > > reading > > > > > > > >>>>> data > > > > > > > >>>>> proportional to the number of unique keys. (2) The > > switching > > > of > > > > > the > > > > > > > >>>>> topic > > > > > > > >>>>> needs modification to the application. If there are man= y > > > > > > applications > > > > > > > >>>>> on a > > > > > > > >>>>> topic, coordinating such an effort may not be easy. Als= o, > > > it's > > > > > not > > > > > > > >>>>> clear > > > > > > > >>>>> how to enforce exactly-once semantic during the switch. > (3) > > > If > > > > a > > > > > > > topic > > > > > > > >>>>> doesn't need any state management, recopying the data > seems > > > > > > wasteful. > > > > > > > >>>>> In > > > > > > > >>>>> that case, in place partition expansion seems more > > desirable. > > > > > > > >>>>> > > > > > > > >>>>> I understand your concern about adding complexity in > > > KStreams. > > > > > But, > > > > > > > >>>>> perhaps > > > > > > > >>>>> we could iterate on that a bit more to see if it can be > > > > > simplified. > > > > > > > >>>>> > > > > > > > >>>>> Jun > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>>> On Mon, Mar 12, 2018 at 11:21 PM, Jan Filipiak < > > > > > > > >>>>> Jan.Filipiak@trivago.com> > > > > > > > >>>>> wrote: > > > > > > > >>>>> > > > > > > > >>>>> Hi Jun, > > > > > > > >>>>> > > > > > > > >>>>>> I will focus on point 61 as I think its _the_ > fundamental > > > part > > > > > > that > > > > > > > I > > > > > > > >>>>>> cant > > > > > > > >>>>>> get across at the moment. > > > > > > > >>>>>> > > > > > > > >>>>>> Kafka is the platform to have state materialized > multiple > > > > times > > > > > > from > > > > > > > >>>>>> one > > > > > > > >>>>>> input. I emphasize this: It is the building block in > > > > > architectures > > > > > > > >>>>>> that > > > > > > > >>>>>> allow you to > > > > > > > >>>>>> have your state maintained multiple times. You put a > > message > > > > in > > > > > > > once, > > > > > > > >>>>>> and > > > > > > > >>>>>> you have it pop out as often as you like. I believe yo= u > > > > > understand > > > > > > > >>>>>> this. > > > > > > > >>>>>> > > > > > > > >>>>>> Now! The path of thinking goes the following: I am usi= ng > > > > apache > > > > > > > kafka > > > > > > > >>>>>> and > > > > > > > >>>>>> I _want_ my state multiple times. What am I going todo= ? > > > > > > > >>>>>> > > > > > > > >>>>>> A) Am I going to take my state that I build up, plunge > > some > > > > sort > > > > > > of > > > > > > > >>>>>> RPC > > > > > > > >>>>>> layer ontop of it, use that RPC layer to throw my > records > > > > across > > > > > > > >>>>>> instances? > > > > > > > >>>>>> B) Am I just going to read the damn message twice? > > > > > > > >>>>>> > > > > > > > >>>>>> Approach A is fundamentally flawed and a violation of > all > > > that > > > > > is > > > > > > > good > > > > > > > >>>>>> and > > > > > > > >>>>>> holy in kafka deployments. I can not understand how th= is > > > Idea > > > > > can > > > > > > > >>>>>> come in > > > > > > > >>>>>> the first place. > > > > > > > >>>>>> (I do understand: IQ in streams, they polluted the kaf= ka > > > > streams > > > > > > > >>>>>> codebase > > > > > > > >>>>>> really bad already. It is not funny! I think they are > > > equally > > > > > > flawed > > > > > > > >>>>>> as > > > > > > > >>>>>> A) > > > > > > > >>>>>> > > > > > > > >>>>>> I say, we do what Kafka is good at. We repartition the > > topic > > > > > once. > > > > > > > We > > > > > > > >>>>>> switch the consumers. > > > > > > > >>>>>> (Those that need more partitions are going to rebuild > > their > > > > > state > > > > > > in > > > > > > > >>>>>> multiple partitions by reading the new topic, those th= at > > > don't > > > > > > just > > > > > > > >>>>>> assign > > > > > > > >>>>>> the new partitions properly) > > > > > > > >>>>>> We switch producers. Done! > > > > > > > >>>>>> > > > > > > > >>>>>> The best thing! It is trivial, hipster stream processo= r > > will > > > > > have > > > > > > an > > > > > > > >>>>>> easy > > > > > > > >>>>>> time with that aswell. Its so super simple. And simple > IS > > > > good! > > > > > > > >>>>>> It is what kafka was build todo. It is how we do it > today. > > > > All I > > > > > > am > > > > > > > >>>>>> saying > > > > > > > >>>>>> is that a little broker help doing the producer swap i= s > > > super > > > > > > > useful. > > > > > > > >>>>>> > > > > > > > >>>>>> For everyone interested in why kafka is so powerful wi= th > > > > > approach > > > > > > B, > > > > > > > >>>>>> please watch https://youtu.be/bEbeZPVo98c?t=3D1633 > > > > > > > >>>>>> I already looked up a good point in time, I think afte= r > 5 > > > > > minutes > > > > > > > the > > > > > > > >>>>>> "state" topic is handled and you should be able to > > > understand > > > > me > > > > > > > >>>>>> and inch better. > > > > > > > >>>>>> > > > > > > > >>>>>> Please do not do A to the project, it deserves better! > > > > > > > >>>>>> > > > > > > > >>>>>> Best Jan > > > > > > > >>>>>> > > > > > > > >>>>>> > > > > > > > >>>>>> > > > > > > > >>>>>> On 13.03.2018 02:40, Jun Rao wrote: > > > > > > > >>>>>> > > > > > > > >>>>>> Hi, Jan, > > > > > > > >>>>>> > > > > > > > >>>>>>> Thanks for the reply. A few more comments below. > > > > > > > >>>>>>> > > > > > > > >>>>>>> 50. Ok, we can think a bit harder for supporting > > compacted > > > > > > topics. > > > > > > > >>>>>>> > > > > > > > >>>>>>> 51. This is a fundamental design question. In the mor= e > > > common > > > > > > case, > > > > > > > >>>>>>> the > > > > > > > >>>>>>> reason why someone wants to increase the number of > > > partitions > > > > > is > > > > > > > that > > > > > > > >>>>>>> the > > > > > > > >>>>>>> consumer application is slow and one wants to run mor= e > > > > consumer > > > > > > > >>>>>>> instances > > > > > > > >>>>>>> to increase the degree of parallelism. So, fixing the > > > number > > > > of > > > > > > > >>>>>>> running > > > > > > > >>>>>>> consumer instances when expanding the partitions won'= t > > help > > > > > this > > > > > > > >>>>>>> case. > > > > > > > >>>>>>> If > > > > > > > >>>>>>> we do need to increase the number of consumer > instances, > > we > > > > > need > > > > > > to > > > > > > > >>>>>>> somehow > > > > > > > >>>>>>> reshuffle the state of the consumer across instances. > > What > > > we > > > > > > have > > > > > > > >>>>>>> been > > > > > > > >>>>>>> discussing in this KIP is whether we can do this more > > > > > effectively > > > > > > > >>>>>>> through > > > > > > > >>>>>>> the KStream library (e.g. through a 2-phase partition > > > > > expansion). > > > > > > > >>>>>>> This > > > > > > > >>>>>>> will > > > > > > > >>>>>>> add some complexity, but it's probably better than > > everyone > > > > > doing > > > > > > > >>>>>>> this > > > > > > > >>>>>>> in > > > > > > > >>>>>>> the application space. The recopying approach that yo= u > > > > > mentioned > > > > > > > >>>>>>> doesn't > > > > > > > >>>>>>> seem to address the consumer state management issue > when > > > the > > > > > > > consumer > > > > > > > >>>>>>> switches from an old to a new topic. > > > > > > > >>>>>>> > > > > > > > >>>>>>> 52. As for your example, it depends on whether the jo= in > > key > > > > is > > > > > > the > > > > > > > >>>>>>> same > > > > > > > >>>>>>> between (A,B) and (B,C). If the join key is the same, > we > > > can > > > > > do a > > > > > > > >>>>>>> 2-phase > > > > > > > >>>>>>> partition expansion of A, B, and C together. If the > join > > > keys > > > > > are > > > > > > > >>>>>>> different, one would need to repartition the data on = a > > > > > different > > > > > > > key > > > > > > > >>>>>>> for > > > > > > > >>>>>>> the second join, then the partition expansion can be > done > > > > > > > >>>>>>> independently > > > > > > > >>>>>>> between (A,B) and (B,C). > > > > > > > >>>>>>> > > > > > > > >>>>>>> 53. If you always fix the number of consumer instance= s, > > we > > > > you > > > > > > > >>>>>>> described > > > > > > > >>>>>>> works. However, as I mentioned in #51, I am not sure > how > > > your > > > > > > > >>>>>>> proposal > > > > > > > >>>>>>> deals with consumer states when the number of consume= r > > > > > instances > > > > > > > >>>>>>> grows. > > > > > > > >>>>>>> Also, it just seems that it's better to avoid > re-copying > > > the > > > > > > > existing > > > > > > > >>>>>>> data. > > > > > > > >>>>>>> > > > > > > > >>>>>>> 60. "just want to throw in my question from the longe= r > > > email > > > > in > > > > > > the > > > > > > > >>>>>>> other > > > > > > > >>>>>>> Thread here. How will the bloom filter help a new > > consumer > > > to > > > > > > > decide > > > > > > > >>>>>>> to > > > > > > > >>>>>>> apply the key or not?" Not sure that I fully understo= od > > > your > > > > > > > >>>>>>> question. > > > > > > > >>>>>>> The > > > > > > > >>>>>>> consumer just reads whatever key is in the log. The > bloom > > > > > filter > > > > > > > just > > > > > > > >>>>>>> helps > > > > > > > >>>>>>> clean up the old keys. > > > > > > > >>>>>>> > > > > > > > >>>>>>> 61. "Why can we afford having a topic where its > > apparently > > > > not > > > > > > > >>>>>>> possible > > > > > > > >>>>>>> to > > > > > > > >>>>>>> start a new application on? I think this is an overal= l > > flaw > > > > of > > > > > > the > > > > > > > >>>>>>> discussed idea here. Not playing attention to the > overall > > > > > > > >>>>>>> architecture." > > > > > > > >>>>>>> Could you explain a bit more when one can't start a n= ew > > > > > > > application? > > > > > > > >>>>>>> > > > > > > > >>>>>>> Jun > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>> On Sat, Mar 10, 2018 at 1:40 AM, Jan Filipiak < > > > > > > > >>>>>>> Jan.Filipiak@trivago.com > > > > > > > >>>>>>> wrote: > > > > > > > >>>>>>> > > > > > > > >>>>>>> Hi Jun, thanks for your mail. > > > > > > > >>>>>>> > > > > > > > >>>>>>> Thank you for your questions! > > > > > > > >>>>>>>> I think they are really good and tackle the core of > the > > > > > problem > > > > > > I > > > > > > > >>>>>>>> see. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I will answer inline, mostly but still want to set t= he > > > tone > > > > > > here. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> The core strength of kafka is what Martin once calle= d > > the > > > > > > > >>>>>>>> kappa-Architecture. How does this work? > > > > > > > >>>>>>>> You have everything as a log as in kafka. When you > need > > to > > > > > > change > > > > > > > >>>>>>>> something. > > > > > > > >>>>>>>> You create the new version of your application and > leave > > > it > > > > > > > running > > > > > > > >>>>>>>> in > > > > > > > >>>>>>>> parallel. > > > > > > > >>>>>>>> Once the new version is good you switch your users t= o > > use > > > > the > > > > > > new > > > > > > > >>>>>>>> application. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> The online reshuffling effectively breaks this > > > architecture > > > > > and > > > > > > I > > > > > > > >>>>>>>> think > > > > > > > >>>>>>>> the switch in thinking here is more harmful > > > > > > > >>>>>>>> than any details about the partitioning function to > > allow > > > > > such a > > > > > > > >>>>>>>> change. > > > > > > > >>>>>>>> I > > > > > > > >>>>>>>> feel with my suggestion we are the closest to > > > > > > > >>>>>>>> the original and battle proven architecture and I ca= n > > only > > > > > warn > > > > > > to > > > > > > > >>>>>>>> move > > > > > > > >>>>>>>> away from it. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I might have forgotten something, sometimes its hard > for > > > me > > > > to > > > > > > > >>>>>>>> getting > > > > > > > >>>>>>>> all > > > > > > > >>>>>>>> the thoughts captured in a mail, but I hope the > comments > > > > > inline > > > > > > > will > > > > > > > >>>>>>>> further make my concern clear, and put some emphasis > on > > > why > > > > I > > > > > > > >>>>>>>> prefer my > > > > > > > >>>>>>>> solution ;) > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> One thing we should all be aware of when discussing > > this, > > > > and > > > > > I > > > > > > > >>>>>>>> think > > > > > > > >>>>>>>> Dong > > > > > > > >>>>>>>> should have mentioned it (maybe he did). > > > > > > > >>>>>>>> We are not discussing all of this out of thin air bu= t > > > there > > > > is > > > > > > an > > > > > > > >>>>>>>> effort > > > > > > > >>>>>>>> in the Samza project. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> https://cwiki.apache.org/confl > uence/display/SAMZA/SEP- > > > 5%3A+ > > > > > > > >>>>>>>> Enable+partition+expansion+of+input+streams > > > > > > > >>>>>>>> https://issues.apache.org/jira/browse/SAMZA-1293 > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> To be clear. I think SEP-5 (state of last week, dont > > know > > > if > > > > > it > > > > > > > >>>>>>>> adapted > > > > > > > >>>>>>>> to > > > > > > > >>>>>>>> this discussion) is on a way better path than KIP-25= 3, > > > and I > > > > > > can't > > > > > > > >>>>>>>> really > > > > > > > >>>>>>>> explain why. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Best Jan, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> nice weekend everyone > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> On 09.03.2018 03:36, Jun Rao wrote: > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Hi, Jan, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Thanks for the feedback. Just some comments on the > > earlier > > > > > > points > > > > > > > >>>>>>>>> that > > > > > > > >>>>>>>>> you > > > > > > > >>>>>>>>> mentioned. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> 50. You brought up the question of whether existing > > data > > > > > needs > > > > > > to > > > > > > > >>>>>>>>> be > > > > > > > >>>>>>>>> copied > > > > > > > >>>>>>>>> during partition expansion. My understand of your > view > > is > > > > > that > > > > > > > >>>>>>>>> avoid > > > > > > > >>>>>>>>> copying existing data will be more efficient, but i= t > > > > doesn't > > > > > > work > > > > > > > >>>>>>>>> well > > > > > > > >>>>>>>>> with > > > > > > > >>>>>>>>> compacted topics since some keys in the original > > > partitions > > > > > > will > > > > > > > >>>>>>>>> never > > > > > > > >>>>>>>>> be > > > > > > > >>>>>>>>> cleaned. It would be useful to understand your use > case > > > of > > > > > > > >>>>>>>>> compacted > > > > > > > >>>>>>>>> topics > > > > > > > >>>>>>>>> a bit more. In the common use case, the data volume > in > > a > > > > > > > compacted > > > > > > > >>>>>>>>> topic > > > > > > > >>>>>>>>> may not be large. So, I am not sure if there is a > > strong > > > > need > > > > > > to > > > > > > > >>>>>>>>> expand > > > > > > > >>>>>>>>> partitions in a compacted topic, at least initially= . > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> I do agree. State is usually smaller. Update rates > > might > > > be > > > > > > also > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> competitively high. > > > > > > > >>>>>>>> Doing Log-compaction (even beeing very efficient and > > > > > > configurable) > > > > > > > >>>>>>>> is > > > > > > > >>>>>>>> also > > > > > > > >>>>>>>> a more expensive operation than > > > > > > > >>>>>>>> just discarding old segments. Further if you want to > use > > > > more > > > > > > > >>>>>>>> consumers > > > > > > > >>>>>>>> processing the events > > > > > > > >>>>>>>> you also have to grow the number of partitions. > > Especially > > > > for > > > > > > > >>>>>>>> use-cases > > > > > > > >>>>>>>> we do (KIP-213) a tiny state full > > > > > > > >>>>>>>> table might be very expensive to process if it joins > > > > against a > > > > > > > huge > > > > > > > >>>>>>>> table. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I can just say we have been in the spot of needing t= o > > grow > > > > log > > > > > > > >>>>>>>> compacted > > > > > > > >>>>>>>> topics. Mainly for processing power we can bring to > the > > > > table. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Further i am not at all concerned about the extra > spaced > > > > used > > > > > by > > > > > > > >>>>>>>> "garbage > > > > > > > >>>>>>>> keys". I am more concerned about the correctness of > > > innocent > > > > > > > >>>>>>>> consumers. > > > > > > > >>>>>>>> The > > > > > > > >>>>>>>> logic becomes complicated. Say for streams one would > > need > > > to > > > > > > load > > > > > > > >>>>>>>> the > > > > > > > >>>>>>>> record into state but not forward it the topology ( = to > > > have > > > > it > > > > > > > >>>>>>>> available > > > > > > > >>>>>>>> for shuffeling). I rather have it simple and a topic > > clean > > > > > > > >>>>>>>> regardless > > > > > > > >>>>>>>> if > > > > > > > >>>>>>>> it > > > > > > > >>>>>>>> still has its old partition count. Especially with > > > multiple > > > > > > > >>>>>>>> partitions > > > > > > > >>>>>>>> growth's I think it becomes insanely hard to to this > > > shuffle > > > > > > > >>>>>>>> correct. > > > > > > > >>>>>>>> Maybe > > > > > > > >>>>>>>> Streams and Samza can do it. Especially if you do > > "hipster > > > > > > stream > > > > > > > >>>>>>>> processing" > > > > > > >>>>>>>> /introducing-kafka-streams- > > > > > > > >>>>>>>> stream-processing-made-simple/>. This makes kafka wa= y > > to > > > > > > > >>>>>>>> complicated. > > > > > > > >>>>>>>> With my approach I think its way simpler because the > > topic > > > > has > > > > > > no > > > > > > > >>>>>>>> "history" > > > > > > > >>>>>>>> in terms of partitioning but is always clean. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> 51. "Growing the topic by an integer factor does not > > > require > > > > > any > > > > > > > >>>>>>>> state > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> redistribution at all." Could you clarify this a bit > > more? > > > > > Let's > > > > > > > say > > > > > > > >>>>>>>>> you > > > > > > > >>>>>>>>> have a consumer app that computes the windowed coun= t > > per > > > > key. > > > > > > If > > > > > > > >>>>>>>>> you > > > > > > > >>>>>>>>> double > > > > > > > >>>>>>>>> the number of partitions from 1 to 2 and grow the > > > consumer > > > > > > > >>>>>>>>> instances > > > > > > > >>>>>>>>> from > > > > > > > >>>>>>>>> 1 > > > > > > > >>>>>>>>> to 2, we would need to redistribute some of the > counts > > to > > > > the > > > > > > new > > > > > > > >>>>>>>>> consumer > > > > > > > >>>>>>>>> instance. Regarding to linear hashing, it's true th= at > > it > > > > > won't > > > > > > > >>>>>>>>> solve > > > > > > > >>>>>>>>> the > > > > > > > >>>>>>>>> problem with compacted topics. The main benefit is > that > > > it > > > > > > > >>>>>>>>> redistributes > > > > > > > >>>>>>>>> the keys in one partition to no more than two > > partitions, > > > > > which > > > > > > > >>>>>>>>> could > > > > > > > >>>>>>>>> help > > > > > > > >>>>>>>>> redistribute the state. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> You don't need to spin up a new consumer in this > case. > > > > every > > > > > > > >>>>>>>>> consumer > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> would just read every partition with the (partition= % > > > > > num_task) > > > > > > > >>>>>>>> task. > > > > > > > >>>>>>>> it > > > > > > > >>>>>>>> will still have the previous data right there and ca= n > go > > > on. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> This sounds contradictory to what I said before, but > > > please > > > > > bear > > > > > > > >>>>>>>> with > > > > > > > >>>>>>>> me. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> 52. Good point on coordinating the expansion of 2 > topics > > > > that > > > > > > need > > > > > > > >>>>>>>> to > > > > > > > >>>>>>>> be > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> joined together. This is where the 2-phase partition > > > > expansion > > > > > > > could > > > > > > > >>>>>>>>> potentially help. In the first phase, we could add > new > > > > > > partitions > > > > > > > >>>>>>>>> to > > > > > > > >>>>>>>>> the 2 > > > > > > > >>>>>>>>> topics one at a time but without publishing to the > new > > > > > > patitions. > > > > > > > >>>>>>>>> Then, > > > > > > > >>>>>>>>> we > > > > > > > >>>>>>>>> can add new consumer instances to pick up the new > > > > partitions. > > > > > > In > > > > > > > >>>>>>>>> this > > > > > > > >>>>>>>>> transition phase, no reshuffling is needed since no > > data > > > is > > > > > > > coming > > > > > > > >>>>>>>>> from > > > > > > > >>>>>>>>> the > > > > > > > >>>>>>>>> new partitions. Finally, we can enable the publishi= ng > > to > > > > the > > > > > > new > > > > > > > >>>>>>>>> partitions. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> I think its even worse than you think. I would like > to > > > > > > introduce > > > > > > > >>>>>>>>> the > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Term > > > > > > > >>>>>>>> transitive copartitioning. Imagine > > > > > > > >>>>>>>> 2 streams application. One joins (A,B) the other (B,= C) > > > then > > > > > > there > > > > > > > >>>>>>>> is a > > > > > > > >>>>>>>> transitive copartition requirement for > > > > > > > >>>>>>>> (A,C) to be copartitioned aswell. This can spread > > > > > significantly > > > > > > > and > > > > > > > >>>>>>>> require many consumers to adapt at the same time. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> It is also not entirely clear to me how you not need > > > > > reshuffling > > > > > > > in > > > > > > > >>>>>>>> this > > > > > > > >>>>>>>> case. If A has a record that never gets updated afte= r > > the > > > > > > > expansion > > > > > > > >>>>>>>> and > > > > > > > >>>>>>>> the > > > > > > > >>>>>>>> coresponding B record moves to a new partition. How > > shall > > > > they > > > > > > > meet > > > > > > > >>>>>>>> w/o > > > > > > > >>>>>>>> shuffle? > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> 53. "Migrating consumer is a step that might be made > > > > completly > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> unnecessary > > > > > > > >>>>>>>>> if - for example streams - takes the gcd as > > partitioning > > > > > scheme > > > > > > > >>>>>>>>> instead > > > > > > > >>>>>>>>> of > > > > > > > >>>>>>>>> enforcing 1 to 1." Not sure that I fully understand > > > this. I > > > > > > think > > > > > > > >>>>>>>>> you > > > > > > > >>>>>>>>> mean > > > > > > > >>>>>>>>> that a consumer application can run more instances > than > > > the > > > > > > > number > > > > > > > >>>>>>>>> of > > > > > > > >>>>>>>>> partitions. In that case, the consumer can just > > > > > repartitioning > > > > > > > the > > > > > > > >>>>>>>>> input > > > > > > > >>>>>>>>> data according to the number of instances. This is > > > > possible, > > > > > > but > > > > > > > >>>>>>>>> just > > > > > > > >>>>>>>>> has > > > > > > > >>>>>>>>> the overhead of reshuffling the data. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> No what I meant is ( that is also your question i > think > > > > > > Mathias) > > > > > > > >>>>>>>>> that > > > > > > > >>>>>>>>> if > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> you grow a topic by a factor. > > > > > > > >>>>>>>> Even if your processor is statefull you can can just > > > assign > > > > > all > > > > > > > the > > > > > > > >>>>>>>> multiples of the previous partition to > > > > > > > >>>>>>>> this consumer and the state to keep processing > correctly > > > > will > > > > > be > > > > > > > >>>>>>>> present > > > > > > > >>>>>>>> w/o any shuffling. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Say you have an assignment > > > > > > > >>>>>>>> Statefull consumer =3D> partition > > > > > > > >>>>>>>> 0 =3D> 0 > > > > > > > >>>>>>>> 1 =3D> 1 > > > > > > > >>>>>>>> 2 =3D> 2 > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> and you grow you topic by 4 you get, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> 0 =3D> 0,3,6,9 > > > > > > > >>>>>>>> 1 =3D> 1,4,7,10 > > > > > > > >>>>>>>> 2 =3D> 2,5,8,11 > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Say your hashcode is 8. 8%3 =3D> 2 before so consum= er > for > > > > > > > partition 2 > > > > > > > >>>>>>>> has > > > > > > > >>>>>>>> it. > > > > > > > >>>>>>>> Now you you have 12 partitions so 8%12 =3D> 8, so it > goes > > > into > > > > > > > >>>>>>>> partition > > > > > > > >>>>>>>> 8 > > > > > > > >>>>>>>> which is assigned to the same consumer > > > > > > > >>>>>>>> who had 2 before and therefore knows the key. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Userland reshuffeling is there as an options. And it > > does > > > > > > exactly > > > > > > > >>>>>>>> what > > > > > > > >>>>>>>> I > > > > > > > >>>>>>>> suggest. And I think its the perfect strategie. All = I > am > > > > > > > suggestion > > > > > > > >>>>>>>> is > > > > > > > >>>>>>>> broker side support to switch the producers to the > newly > > > > > > > partitioned > > > > > > > >>>>>>>> topic. > > > > > > > >>>>>>>> Then the old (to few partition topic) can go away. > > > Remember > > > > > the > > > > > > > >>>>>>>> list > > > > > > > >>>>>>>> of > > > > > > > >>>>>>>> steps in the beginning of this thread. If one has > broker > > > > > support > > > > > > > for > > > > > > > >>>>>>>> all > > > > > > > >>>>>>>> where its required and streams support for those tha= t > > > aren=E2=80=99t > > > > > > > >>>>>>>> necessarily. > > > > > > > >>>>>>>> Then one has solved the problem. > > > > > > > >>>>>>>> I repeat it because I think its important. I am real= ly > > > happy > > > > > > that > > > > > > > >>>>>>>> you > > > > > > > >>>>>>>> brought that up! because its 100% what I want just > with > > > the > > > > > > > >>>>>>>> differences > > > > > > > >>>>>>>> to > > > > > > > >>>>>>>> have an option to discard the to small topic later > > (after > > > > all > > > > > > > >>>>>>>> consumers > > > > > > > >>>>>>>> adapted). And to have order correct there. I need > broker > > > > > support > > > > > > > >>>>>>>> managing > > > > > > > >>>>>>>> the copy process + the produces and fence them again= st > > > each > > > > > > > other. I > > > > > > > >>>>>>>> also > > > > > > > >>>>>>>> repeat. the copy process can run for weeks in the > worst > > > > case. > > > > > > > >>>>>>>> Copying > > > > > > > >>>>>>>> the > > > > > > > >>>>>>>> data is not the longest task migrating consumers mig= ht > > > very > > > > > well > > > > > > > be. > > > > > > > >>>>>>>> Once all consumers switched and copying is really up > to > > > date > > > > > > > (think > > > > > > > >>>>>>>> ISR > > > > > > > >>>>>>>> like up to date) only then we stop the producer, wai= t > > for > > > > the > > > > > > copy > > > > > > > >>>>>>>> to > > > > > > > >>>>>>>> finish and use the new topic for producing. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> After this the topic is perfect in shape. and no one > > needs > > > > to > > > > > > > worry > > > > > > > >>>>>>>> about > > > > > > > >>>>>>>> complicated stuff. (old keys hanging around might > arrive > > > in > > > > > some > > > > > > > >>>>>>>> other > > > > > > > >>>>>>>> topic later.....). can only imagine how many tricky > bugs > > > > gonna > > > > > > > >>>>>>>> arrive > > > > > > > >>>>>>>> after > > > > > > > >>>>>>>> someone had grown and shrunken is topic 10 times. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> 54. "The other thing I wanted to mention is that I > > believe > > > > the > > > > > > > >>>>>>>> current > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> suggestion (without copying data over) can be > > implemented > > > in > > > > > > pure > > > > > > > >>>>>>>>> userland > > > > > > > >>>>>>>>> with a custom partitioner and a small feedbackloop > from > > > > > > > >>>>>>>>> ProduceResponse > > > > > > > >>>>>>>>> =3D> > > > > > > > >>>>>>>>> Partitionier in coorporation with a change manageme= nt > > > > > system." > > > > > > I > > > > > > > am > > > > > > > >>>>>>>>> not > > > > > > > >>>>>>>>> sure a customized partitioner itself solves the > > problem. > > > We > > > > > > > >>>>>>>>> probably > > > > > > > >>>>>>>>> need > > > > > > > >>>>>>>>> some broker side support to enforce when the new > > > partitions > > > > > can > > > > > > > be > > > > > > > >>>>>>>>> used. > > > > > > > >>>>>>>>> We > > > > > > > >>>>>>>>> also need some support on the consumer/kstream side > to > > > > > preserve > > > > > > > the > > > > > > > >>>>>>>>> per > > > > > > > >>>>>>>>> key > > > > > > > >>>>>>>>> ordering and potentially migrate the processing > state. > > > This > > > > > is > > > > > > > not > > > > > > > >>>>>>>>> trivial > > > > > > > >>>>>>>>> and I am not sure if it's ideal to fully push to th= e > > > > > > application > > > > > > > >>>>>>>>> space. > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Broker support is defenitly the preferred way here.= I > > > have > > > > > > > nothing > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> against > > > > > > > >>>>>>>> broker support. > > > > > > > >>>>>>>> I tried to say that for what I would preffer - copyi= ng > > the > > > > > data > > > > > > > >>>>>>>> over, > > > > > > > >>>>>>>> at > > > > > > > >>>>>>>> least for log compacted topics - > > > > > > > >>>>>>>> I would require more broker support than the KIP > > currently > > > > > > offers. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Jun > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> On Tue, Mar 6, 2018 at 10:33 PM, Jan Filipiak < > > > > > > > >>>>>>>>> Jan.Filipiak@trivago.com > > > > > > > >>>>>>>>> wrote: > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Hi Dong, > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> are you actually reading my emails, or are you just > > using > > > > the > > > > > > > >>>>>>>>> thread I > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>>> started for general announcements regarding the KI= P? > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> I tried to argue really hard against linear hashin= g. > > > > Growing > > > > > > the > > > > > > > >>>>>>>>>> topic > > > > > > > >>>>>>>>>> by > > > > > > > >>>>>>>>>> an integer factor does not require any state > > > > redistribution > > > > > at > > > > > > > >>>>>>>>>> all. I > > > > > > > >>>>>>>>>> fail > > > > > > > >>>>>>>>>> to see completely where linear hashing helps on lo= g > > > > > compacted > > > > > > > >>>>>>>>>> topics. > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> If you are not willing to explain to me what I mig= ht > > be > > > > > > > >>>>>>>>>> overlooking: > > > > > > > >>>>>>>>>> that > > > > > > > >>>>>>>>>> is fine. > > > > > > > >>>>>>>>>> But I ask you to not reply to my emails then. Plea= se > > > > > > understand > > > > > > > my > > > > > > > >>>>>>>>>> frustration with this. > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> Best Jan > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> On 06.03.2018 19:38, Dong Lin wrote: > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> Hi everyone, > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> Thanks for all the comments! It appears that > everyone > > > > > prefers > > > > > > > >>>>>>>>>> linear > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>>> hashing because it reduces the amount of state th= at > > > needs > > > > > to > > > > > > be > > > > > > > >>>>>>>>>>> moved > > > > > > > >>>>>>>>>>> between consumers (for stream processing). The KI= P > > has > > > > been > > > > > > > >>>>>>>>>>> updated > > > > > > > >>>>>>>>>>> to > > > > > > > >>>>>>>>>>> use > > > > > > > >>>>>>>>>>> linear hashing. > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Regarding the migration endeavor: it seems that > > > migrating > > > > > > > >>>>>>>>>>> producer > > > > > > > >>>>>>>>>>> library > > > > > > > >>>>>>>>>>> to use linear hashing should be pretty > > straightforward > > > > > > without > > > > > > > >>>>>>>>>>> much operational endeavor. If we don't upgrade > client > > > > > library > > > > > > > to > > > > > > > >>>>>>>>>>> use > > > > > > > >>>>>>>>>>> this > > > > > > > >>>>>>>>>>> KIP, we can not support in-order delivery after > > > partition > > > > > is > > > > > > > >>>>>>>>>>> changed > > > > > > > >>>>>>>>>>> anyway. Suppose we upgrade client library to use > this > > > > KIP, > > > > > if > > > > > > > >>>>>>>>>>> partition > > > > > > > >>>>>>>>>>> number is not changed, the key -> partition mappi= ng > > > will > > > > be > > > > > > > >>>>>>>>>>> exactly > > > > > > > >>>>>>>>>>> the > > > > > > > >>>>>>>>>>> same as it is now because it is still determined > > using > > > > > > > >>>>>>>>>>> murmur_hash(key) > > > > > > > >>>>>>>>>>> % > > > > > > > >>>>>>>>>>> original_partition_num. In other words, this chan= ge > > is > > > > > > backward > > > > > > > >>>>>>>>>>> compatible. > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Regarding the load distribution: if we use linear > > > > hashing, > > > > > > the > > > > > > > >>>>>>>>>>> load > > > > > > > >>>>>>>>>>> may > > > > > > > >>>>>>>>>>> be > > > > > > > >>>>>>>>>>> unevenly distributed because those partitions whi= ch > > are > > > > not > > > > > > > split > > > > > > > >>>>>>>>>>> may > > > > > > > >>>>>>>>>>> receive twice as much traffic as other partitions > > that > > > > are > > > > > > > split. > > > > > > > >>>>>>>>>>> This > > > > > > > >>>>>>>>>>> issue can be mitigated by creating topic with > > > partitions > > > > > that > > > > > > > are > > > > > > > >>>>>>>>>>> several > > > > > > > >>>>>>>>>>> times the number of consumers. And there will be = no > > > > > imbalance > > > > > > > if > > > > > > > >>>>>>>>>>> the > > > > > > > >>>>>>>>>>> partition number is always doubled. So this > imbalance > > > > seems > > > > > > > >>>>>>>>>>> acceptable. > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Regarding storing the partition strategy as > per-topic > > > > > config: > > > > > > > It > > > > > > > >>>>>>>>>>> seems > > > > > > > >>>>>>>>>>> not > > > > > > > >>>>>>>>>>> necessary since we can still use murmur_hash as t= he > > > > default > > > > > > > hash > > > > > > > >>>>>>>>>>> function > > > > > > > >>>>>>>>>>> and additionally apply the linear hashing algorit= hm > > if > > > > the > > > > > > > >>>>>>>>>>> partition > > > > > > > >>>>>>>>>>> number > > > > > > > >>>>>>>>>>> has increased. Not sure if there is any use-case > for > > > > > producer > > > > > > > to > > > > > > > >>>>>>>>>>> use a > > > > > > > >>>>>>>>>>> different hash function. Jason, can you check if > > there > > > is > > > > > > some > > > > > > > >>>>>>>>>>> use-case > > > > > > > >>>>>>>>>>> that I missed for using the per-topic partition > > > strategy? > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Regarding how to reduce latency (due to state > > > store/load) > > > > > in > > > > > > > >>>>>>>>>>> stream > > > > > > > >>>>>>>>>>> processing consumer when partition number changes= : > I > > > need > > > > > to > > > > > > > read > > > > > > > >>>>>>>>>>> the > > > > > > > >>>>>>>>>>> Kafka > > > > > > > >>>>>>>>>>> Stream code to understand how Kafka Stream > currently > > > > > migrate > > > > > > > >>>>>>>>>>> state > > > > > > > >>>>>>>>>>> between > > > > > > > >>>>>>>>>>> consumers when the application is added/removed > for a > > > > given > > > > > > > job. > > > > > > > >>>>>>>>>>> I > > > > > > > >>>>>>>>>>> will > > > > > > > >>>>>>>>>>> reply after I finish reading the documentation an= d > > > code. > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>> Dong > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> On Mon, Mar 5, 2018 at 10:43 AM, Jason Gustafson = < > > > > > > > >>>>>>>>>>> jason@confluent.io> > > > > > > > >>>>>>>>>>> wrote: > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Great discussion. I think I'm wondering whether w= e > > can > > > > > > continue > > > > > > > >>>>>>>>>>> to > > > > > > > >>>>>>>>>>> leave > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> Kafka agnostic to the partitioning strategy. The > > > > challenge > > > > > is > > > > > > > >>>>>>>>>>> > > > > > > > >>>>>>>>>>> communicating > > > > > > > >>>>>>>>>>>> the partitioning logic from producers to consume= rs > > so > > > > that > > > > > > the > > > > > > > >>>>>>>>>>>> dependencies > > > > > > > >>>>>>>>>>>> between each epoch can be determined. For the sa= ke > > of > > > > > > > >>>>>>>>>>>> discussion, > > > > > > > >>>>>>>>>>>> imagine > > > > > > > >>>>>>>>>>>> you did something like the following: > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> 1. The name (and perhaps version) of a > partitioning > > > > > strategy > > > > > > > is > > > > > > > >>>>>>>>>>>> stored > > > > > > > >>>>>>>>>>>> in > > > > > > > >>>>>>>>>>>> topic configuration when a topic is created. > > > > > > > >>>>>>>>>>>> 2. The producer looks up the partitioning strate= gy > > > > before > > > > > > > >>>>>>>>>>>> writing > > > > > > > >>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>> a > > > > > > > >>>>>>>>>>>> topic and includes it in the produce request (fo= r > > > > > fencing). > > > > > > If > > > > > > > >>>>>>>>>>>> it > > > > > > > >>>>>>>>>>>> doesn't > > > > > > > >>>>>>>>>>>> have an implementation for the configured > strategy, > > it > > > > > > fails. > > > > > > > >>>>>>>>>>>> 3. The consumer also looks up the partitioning > > > strategy > > > > > and > > > > > > > >>>>>>>>>>>> uses it > > > > > > > >>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>> determine dependencies when reading a new epoch. > It > > > > could > > > > > > > either > > > > > > > >>>>>>>>>>>> fail > > > > > > > >>>>>>>>>>>> or > > > > > > > >>>>>>>>>>>> make the most conservative dependency assumption= s > if > > > it > > > > > > > doesn't > > > > > > > >>>>>>>>>>>> know > > > > > > > >>>>>>>>>>>> how > > > > > > > >>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>> implement the partitioning strategy. For the > > consumer, > > > > the > > > > > > new > > > > > > > >>>>>>>>>>>> interface > > > > > > > >>>>>>>>>>>> might look something like this: > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> // Return the partition dependencies following a= n > > > epoch > > > > > bump > > > > > > > >>>>>>>>>>>> Map> dependencies(int > > > > > > > >>>>>>>>>>>> numPartitionsBeforeEpochBump, > > > > > > > >>>>>>>>>>>> int numPartitionsAfterEpochBump) > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> The unordered case then is just a particular > > > > > implementation > > > > > > > >>>>>>>>>>>> which > > > > > > > >>>>>>>>>>>> never > > > > > > > >>>>>>>>>>>> has > > > > > > > >>>>>>>>>>>> any epoch dependencies. To implement this, we > would > > > need > > > > > > some > > > > > > > >>>>>>>>>>>> way > > > > > > > >>>>>>>>>>>> for > > > > > > > >>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>> consumer to find out how many partitions there > were > > in > > > > > each > > > > > > > >>>>>>>>>>>> epoch, > > > > > > > >>>>>>>>>>>> but > > > > > > > >>>>>>>>>>>> maybe that's not too unreasonable. > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>>> Jason > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> On Mon, Mar 5, 2018 at 4:51 AM, Jan Filipiak < > > > > > > > >>>>>>>>>>>> Jan.Filipiak@trivago.com > > > > > > > >>>>>>>>>>>> wrote: > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> Hi Dong > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> thank you very much for your questions. > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> regarding the time spend copying data across: > > > > > > > >>>>>>>>>>>>> It is correct that copying data from a topic wi= th > > one > > > > > > > partition > > > > > > > >>>>>>>>>>>>> mapping > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> a topic with a different partition mapping take= s > > way > > > > > longer > > > > > > > >>>>>>>>>>>>> than > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> we > > > > > > > >>>>>>>>>>>> can > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> stop producers. Tens of minutes is a very > optimistic > > > > > > estimate > > > > > > > >>>>>>>>>>>> here. > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Many > > > > > > > >>>>>>>>>>>>> people can not afford copy full steam and > therefore > > > > will > > > > > > have > > > > > > > >>>>>>>>>>>>> some > > > > > > > >>>>>>>>>>>>> rate > > > > > > > >>>>>>>>>>>>> limiting in place, this can bump the timespan > into > > > the > > > > > > day's. > > > > > > > >>>>>>>>>>>>> The > > > > > > > >>>>>>>>>>>>> good > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> part > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> is that the vast majority of the data can be > copied > > > > while > > > > > > the > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> producers > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> are > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> still going. One can then, piggyback the > consumers > > > > ontop > > > > > of > > > > > > > >>>>>>>>>>>>> this > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> timeframe, > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> by the method mentioned (provide them an mappin= g > > from > > > > > their > > > > > > > old > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> offsets > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> new offsets in their repartitioned topics. In > that > > > way > > > > we > > > > > > > >>>>>>>>>>>>> separate > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> migration of consumers from migration of produce= rs > > > > > > (decoupling > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> these > > > > > > > >>>>>>>>>>>>> is > > > > > > > >>>>>>>>>>>>> what kafka is strongest at). The time to actual= ly > > > swap > > > > > over > > > > > > > the > > > > > > > >>>>>>>>>>>>> producers > > > > > > > >>>>>>>>>>>>> should be kept minimal by ensuring that when a > swap > > > > > attempt > > > > > > > is > > > > > > > >>>>>>>>>>>>> started > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> consumer copying over should be very close to t= he > > log > > > > end > > > > > > and > > > > > > > >>>>>>>>>>>>> is > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> expected > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> to finish within the next fetch. The operation > > should > > > > > have > > > > > > a > > > > > > > >>>>>>>>>>>>> time-out > > > > > > > >>>>>>>>>>>>> and > > > > > > > >>>>>>>>>>>>> should be "reattemtable". > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Importance of logcompaction: > > > > > > > >>>>>>>>>>>>> If a producer produces key A, to partiton 0, it= s > > > > forever > > > > > > > gonna > > > > > > > >>>>>>>>>>>>> be > > > > > > > >>>>>>>>>>>>> there, > > > > > > > >>>>>>>>>>>>> unless it gets deleted. The record might sit in > > there > > > > for > > > > > > > >>>>>>>>>>>>> years. A > > > > > > > >>>>>>>>>>>>> new > > > > > > > >>>>>>>>>>>>> producer started with the new partitions will > fail > > to > > > > > > delete > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>> record > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> in > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> the correct partition. Th record will be there > > > forever > > > > > and > > > > > > > one > > > > > > > >>>>>>>>>>>>> can > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> not > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> reliable bootstrap new consumers. I cannot see h= ow > > > > linear > > > > > > > >>>>>>>>>>>> hashing > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> can > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> solve > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> this. > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> Regarding your skipping of userland copying: > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> 100%, copying the data across in userland is, a= s > > far > > > > as i > > > > > > can > > > > > > > >>>>>>>>>>>>> see, > > > > > > > >>>>>>>>>>>>> only > > > > > > > >>>>>>>>>>>>> a > > > > > > > >>>>>>>>>>>>> usecase for log compacted topics. Even for > > > > logcompaction > > > > > + > > > > > > > >>>>>>>>>>>>> retentions > > > > > > > >>>>>>>>>>>>> it > > > > > > > >>>>>>>>>>>>> should only be opt-in. Why did I bring it up? I > > think > > > > log > > > > > > > >>>>>>>>>>>>> compaction > > > > > > > >>>>>>>>>>>>> is > > > > > > > >>>>>>>>>>>>> a > > > > > > > >>>>>>>>>>>>> very important feature to really embrace kafka > as a > > > > "data > > > > > > > >>>>>>>>>>>>> plattform". > > > > > > > >>>>>>>>>>>>> The > > > > > > > >>>>>>>>>>>>> point I also want to make is that copying data > this > > > way > > > > > is > > > > > > > >>>>>>>>>>>>> completely > > > > > > > >>>>>>>>>>>>> inline with the kafka architecture. it only > > consists > > > of > > > > > > > reading > > > > > > > >>>>>>>>>>>>> and > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> writing > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> to topics. > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> I hope it clarifies more why I think we should a= im > > for > > > > > more > > > > > > > than > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>> current KIP. I fear that once the KIP is done n= ot > > > much > > > > > more > > > > > > > >>>>>>>>>>>>> effort > > > > > > > >>>>>>>>>>>>> will > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> be > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> taken. > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> > > > > > > > >>>>>>>>>>>> On 04.03.2018 02:28, Dong Lin wrote: > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Hey Jan, > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> In the current proposal, the consumer will be > > blocked > > > > on > > > > > > > >>>>>>>>>>>>> waiting > > > > > > > >>>>>>>>>>>>> for > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> other > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> consumers of the group to consume up to a give= n > > > > offset. > > > > > In > > > > > > > >>>>>>>>>>>>>> most > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> cases, > > > > > > > >>>>>>>>>>>>> all > > > > > > > >>>>>>>>>>>>> consumers should be close to the LEO of the > > > partitions > > > > > when > > > > > > > the > > > > > > > >>>>>>>>>>>>> partition > > > > > > > >>>>>>>>>>>>> expansion happens. Thus the time waiting should > not > > > be > > > > > long > > > > > > > >>>>>>>>>>>>> e.g. > > > > > > > >>>>>>>>>>>>> on > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> order of seconds. On the other hand, it may tak= e > a > > > long > > > > > > time > > > > > > > to > > > > > > > >>>>>>>>>>>>> wait > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> for > > > > > > > >>>>>>>>>>>>>> the entire partition to be copied -- the amoun= t > of > > > > time > > > > > is > > > > > > > >>>>>>>>>>>>>> proportional > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> the amount of existing data in the partition, > > which > > > > can > > > > > > take > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> tens of > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> minutes. So the amount of time that we stop > > consumers > > > > may > > > > > > not > > > > > > > >>>>>>>>>>>>> be > > > > > > > >>>>>>>>>>>>> on > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>> same order of magnitude. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> If we can implement this suggestion without > > copying > > > > data > > > > > > > over > > > > > > > >>>>>>>>>>>>>> in > > > > > > > >>>>>>>>>>>>>> purse > > > > > > > >>>>>>>>>>>>>> userland, it will be much more valuable. Do yo= u > > have > > > > > ideas > > > > > > > on > > > > > > > >>>>>>>>>>>>>> how > > > > > > > >>>>>>>>>>>>>> this > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> can > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> be done? > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Not sure why the current KIP not help people wh= o > > > depend > > > > > on > > > > > > > log > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> compaction. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Could you elaborate more on this point? > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> Dong > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> On Wed, Feb 28, 2018 at 10:55 PM, Jan > > > > > > > >>>>>>>>>>>>>> Filipiak > > > > > > >>>>>>>>>>>>>> com > > > > > > > >>>>>>>>>>>>>> wrote: > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Hi Dong, > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> I tried to focus on what the steps are one can > > > > currently > > > > > > > >>>>>>>>>>>>>> perform > > > > > > > >>>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> expand > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> or shrink a keyed topic while maintaining a t= op > > > notch > > > > > > > >>>>>>>>>>>>>>> semantics. > > > > > > > >>>>>>>>>>>>>>> I can understand that there might be confusio= n > > > about > > > > > > > >>>>>>>>>>>>>>> "stopping > > > > > > > >>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>> consumer". It is exactly the same as proposed > in > > > the > > > > > KIP. > > > > > > > >>>>>>>>>>>>>>> there > > > > > > > >>>>>>>>>>>>>>> needs > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> be > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> a time the producers agree on the new > > partitioning. > > > > The > > > > > > > extra > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> semantics I > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> want to put in there is that we have a > > possibility > > > to > > > > > > wait > > > > > > > >>>>>>>>>>>>>>> until > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> all > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> existing data > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> is copied over into the new partitioning schem= e. > > > When > > > > I > > > > > > say > > > > > > > >>>>>>>>>>>>> stopping > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> I > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> think more of having a memory barrier that > > ensures > > > > the > > > > > > > >>>>>>>>>>>>>>> ordering. I > > > > > > > >>>>>>>>>>>>>>> am > > > > > > > >>>>>>>>>>>>>>> still > > > > > > > >>>>>>>>>>>>>>> aming for latencies on the scale of leader > > > > failovers. > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Consumers have to explicitly adapt the new > > > > partitioning > > > > > > > >>>>>>>>>>>>>>> scheme > > > > > > > >>>>>>>>>>>>>>> in > > > > > > > >>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>> above scenario. The reason is that in these > cases > > > > where > > > > > > you > > > > > > > >>>>>>>>>>>>>>> are > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> dependent > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> on a particular partitioning scheme, you also > > have > > > > > other > > > > > > > >>>>>>>>>>>>>>> topics > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> that > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> have > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> co-partition enforcements or the kind > -frequently. > > > > > > Therefore > > > > > > > >>>>>>>>>>>>>> all > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> your > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> other > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> input topics might need to grow accordingly. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> What I was suggesting was to streamline all > these > > > > > > > operations > > > > > > > >>>>>>>>>>>>>>> as > > > > > > > >>>>>>>>>>>>>>> best > > > > > > > >>>>>>>>>>>>>>> as > > > > > > > >>>>>>>>>>>>>>> possible to have "real" partition grow and > > > shrinkage > > > > > > going > > > > > > > >>>>>>>>>>>>>>> on. > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Migrating > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> the producers to a new partitioning scheme ca= n > be > > > > much > > > > > > more > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> streamlined > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> with proper broker support for this. Migrating > > > consumer > > > > > is > > > > > > a > > > > > > > >>>>>>>>>>>>> step > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> that > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> might be made completly unnecessary if - for > > > example > > > > > > > streams > > > > > > > >>>>>>>>>>>>>>> - > > > > > > > >>>>>>>>>>>>>>> takes > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> gcd as partitioning scheme instead of > enforcing 1 > > > to > > > > 1. > > > > > > > >>>>>>>>>>>>>>> Connect > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> consumers > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> and other consumers should be fine anyways. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> I hope this makes more clear where I was aimin= g > > at. > > > > The > > > > > > rest > > > > > > > >>>>>>>>>>>>> needs > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> be > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> figured out. The only danger i see is that wh= en > > we > > > > are > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> introducing > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> this > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> feature as supposed in the KIP, it wont help an= y > > > people > > > > > > > >>>>>>>>>>>>> depending > > > > > > > >>>>>>>>>>>>> on > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> log > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> compaction. > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> The other thing I wanted to mention is that I > > > believe > > > > > the > > > > > > > >>>>>>>>>>>>> current > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> suggestion (without copying data over) can be > > > > implemented > > > > > > in > > > > > > > >>>>>>>>>>>>>> pure > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> userland > > > > > > > >>>>>>>>>>>>>>> with a custom partitioner and a small > > feedbackloop > > > > from > > > > > > > >>>>>>>>>>>>>>> ProduceResponse > > > > > > > >>>>>>>>>>>>>>> =3D> > > > > > > > >>>>>>>>>>>>>>> Partitionier in coorporation with a change > > > management > > > > > > > system. > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Best Jan > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> On 28.02.2018 07:13, Dong Lin wrote: > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Hey Jan, > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> I am not sure if it is acceptable for produce= r > to > > > be > > > > > > > stopped > > > > > > > >>>>>>>>>>>>>>> for a > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> while, > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> particularly for online application which > > requires > > > > low > > > > > > > >>>>>>>>>>>>>>>> latency. I > > > > > > > >>>>>>>>>>>>>>>> am > > > > > > > >>>>>>>>>>>>>>>> also > > > > > > > >>>>>>>>>>>>>>>> not sure how consumers can switch to a new > > topic. > > > > Does > > > > > > > user > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> application > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> needs to explicitly specify a different topi= c > > for > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> producer/consumer > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> subscribe to? It will be helpful for discussion > if > > > you > > > > > can > > > > > > > >>>>>>>>>>>>> provide > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> more > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> detail on the interface change for this > solution. > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> Dong > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> On Mon, Feb 26, 2018 at 12:48 AM, Jan > > > > > > > >>>>>>>>>>>>>>>> Filipiak > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> com > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> wrote: > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Hi, > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> just want to throw my though in. In general th= e > > > > > > > functionality > > > > > > > >>>>>>>>>>>>>> is > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> very > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> usefull, we should though not try to find th= e > > > > > > architecture > > > > > > > >>>>>>>>>>>>>>>> to > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> hard > > > > > > > >>>>>>>>>>>>>>>>> while > > > > > > > >>>>>>>>>>>>>>>>> implementing. > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> The manual steps would be to > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> create a new topic > > > > > > > >>>>>>>>>>>>>>>>> the mirrormake from the new old topic to th= e > > new > > > > > topic > > > > > > > >>>>>>>>>>>>>>>>> wait for mirror making to catch up. > > > > > > > >>>>>>>>>>>>>>>>> then put the consumers onto the new topic > > > > > > > >>>>>>>>>>>>>>>>> (having mirrormaker spit out a > > mapping > > > > > from > > > > > > > old > > > > > > > >>>>>>>>>>>>>>>>> offsets to > > > > > > > >>>>>>>>>>>>>>>>> new > > > > > > > >>>>>>>>>>>>>>>>> offsets: > > > > > > > >>>>>>>>>>>>>>>>> if topic is increased by > factor > > X > > > > > there > > > > > > is > > > > > > > >>>>>>>>>>>>>>>>> gonna > > > > > > > >>>>>>>>>>>>>>>>> be a > > > > > > > >>>>>>>>>>>>>>>>> clean > > > > > > > >>>>>>>>>>>>>>>>> mapping from 1 offset in the old topic to X > > > offsets > > > > > in > > > > > > > the > > > > > > > >>>>>>>>>>>>>>>>> new > > > > > > > >>>>>>>>>>>>>>>>> topic, > > > > > > > >>>>>>>>>>>>>>>>> if there is no factor then > there > > > is > > > > no > > > > > > > >>>>>>>>>>>>>>>>> chance to > > > > > > > >>>>>>>>>>>>>>>>> generate a > > > > > > > >>>>>>>>>>>>>>>>> mapping that can be reasonable used for > > > continuing) > > > > > > > >>>>>>>>>>>>>>>>> make consumers stop at appropria= te > > > > points > > > > > > and > > > > > > > >>>>>>>>>>>>>>>>> continue > > > > > > > >>>>>>>>>>>>>>>>> consumption > > > > > > > >>>>>>>>>>>>>>>>> with offsets from the mapping. > > > > > > > >>>>>>>>>>>>>>>>> have the producers stop for a minimal time. > > > > > > > >>>>>>>>>>>>>>>>> wait for mirrormaker to finish > > > > > > > >>>>>>>>>>>>>>>>> let producer produce with the new metadata. > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> Instead of implementing the approach sugges= t > in > > > the > > > > > KIP > > > > > > > >>>>>>>>>>>>>>>>> which > > > > > > > >>>>>>>>>>>>>>>>> will > > > > > > > >>>>>>>>>>>>>>>>> leave > > > > > > > >>>>>>>>>>>>>>>>> log compacted topic completely crumbled and > > > > unusable. > > > > > > > >>>>>>>>>>>>>>>>> I would much rather try to build > infrastructure > > > to > > > > > > > support > > > > > > > >>>>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>>>> mentioned > > > > > > > >>>>>>>>>>>>>>>>> above operations more smoothly. > > > > > > > >>>>>>>>>>>>>>>>> Especially having producers stop and use > > another > > > > > topic > > > > > > is > > > > > > > >>>>>>>>>>>>>>>>> difficult > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> and > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> it would be nice if one can trigger "invali= d > > > > > metadata" > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> exceptions > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> for > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> them > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> and > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> if one could give topics aliases so that thei= r > > > > produces > > > > > > > with > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>>>> old > > > > > > > >>>>>>>>>>>>>>>>> topic > > > > > > > >>>>>>>>>>>>>>>>> will arrive in the new topic. > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> The downsides are obvious I guess ( having > the > > > same > > > > > > data > > > > > > > >>>>>>>>>>>>>>>>> twice > > > > > > > >>>>>>>>>>>>>>>>> for > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> transition period, but kafka tends to scale > > well > > > > with > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> datasize). > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> So > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> its a > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> nicer fit into the architecture. > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> I further want to argument that the > functionality > > > by > > > > > the > > > > > > > KIP > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> can > > > > > > > >>>>>>>>>>>>>>>>> completely be implementing in "userland" > with a > > > > > custom > > > > > > > >>>>>>>>>>>>>>>>> partitioner > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> that > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> handles the transition as needed. I would > > > > appreciate > > > > > if > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> someone > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>> could > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>> point > > > > > > > >>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>> out what a custom partitioner couldn't handle i= n > > this > > > > > case? > > > > > > > >>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> With the above approach, shrinking a topic > > becomes > > > > the > > > > > > same > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> steps. > > > > > > > >>>>>>>>>>>>>>>>> Without > > > > > > > >>>>>>>>>>>>>>>>> loosing keys in the discontinued partitions= . > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> Would love to hear what everyone thinks. > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> Best Jan > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> On 11.02.2018 00:35, Dong Lin wrote: > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> Hi all, > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> I have created KIP-253: Support in-order > > message > > > > > > delivery > > > > > > > >>>>>>>>>>>>>>>>> with > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> partition > > > > > > > >>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> expansion. See > > > > > > > >>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confl > > > > > > > >>>>>>>>>>>>>>>>>> uence/display/KAFKA/KIP-253% > > > > > > > >>>>>>>>>>>>>>>>>> 3A+Support+in-order+message+de > > > > > > > >>>>>>>>>>>>>>>>>> livery+with+partition+expansio > > > > > > > >>>>>>>>>>>>>>>>>> n > > > > > > > >>>>>>>>>>>>>>>>>> . > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> This KIP provides a way to allow messages = of > > the > > > > > same > > > > > > > key > > > > > > > >>>>>>>>>>>>>>>>>> from > > > > > > > >>>>>>>>>>>>>>>>>> the > > > > > > > >>>>>>>>>>>>>>>>>> same > > > > > > > >>>>>>>>>>>>>>>>>> producer to be consumed in the same order > they > > > are > > > > > > > >>>>>>>>>>>>>>>>>> produced > > > > > > > >>>>>>>>>>>>>>>>>> even > > > > > > > >>>>>>>>>>>>>>>>>> if > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> we > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> expand partition of the topic. > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>> Thanks, > > > > > > > >>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> Dong > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>>>>>>>>>>>>>>>> > > > > > > > >>>> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > -- Guozhang > --0000000000002b61730568f3827b--