Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 41FA9200ACA for ; Thu, 19 May 2016 07:30:12 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 35B1E160A15; Thu, 19 May 2016 05:30:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2C92F160A00 for ; Thu, 19 May 2016 07:30:11 +0200 (CEST) Received: (qmail 38111 invoked by uid 500); 19 May 2016 05:30:05 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 38099 invoked by uid 99); 19 May 2016 05:30:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2016 05:30:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7EDE1C2EEA for ; Thu, 19 May 2016 05:30:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent-io.20150623.gappssmtp.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 2jGqViFCvkMr for ; Thu, 19 May 2016 05:30:02 +0000 (UTC) Received: from mail-ig0-f182.google.com (mail-ig0-f182.google.com [209.85.213.182]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 459F75F233 for ; Thu, 19 May 2016 05:30:01 +0000 (UTC) Received: by mail-ig0-f182.google.com with SMTP id bi2so108720978igb.0 for ; Wed, 18 May 2016 22:30:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent-io.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to; bh=/3Fur1AjJwcd9YIoLMXtsCrf2F38gc7CbTFItSnEQ08=; b=ps506kkeKMF9t4xuEM8zDQ0sB+FTRuMsOlGkk6XE/KuxJGcDiAotfr6P5zucG/TZN1 yeLYy5OsFCbUFHH+RFMqjt/pSV5t22hXzzC1a5ROCkH3d05wSt5M5JJvnKaVKGv1fBI5 TwMZtJwmbWqA2Dr4OC7s2s6T7WsXHEZw/S3gqJf6rAYEIi1FWR2n/DCgJI0xAuxLBl5e +t35UE6XcMh0PFRz+CzCQij0vRuD6UWx8pnGyChT3rbUn/duGlk0XQl4IVPebtrwvWyq 9VWX7NWZpzEdfXYU7MwK9NQ6lbPZlp/vLIxxxDjZKhWzNU+bTbVCP5mGxnaWfZjdQrgV raJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to; bh=/3Fur1AjJwcd9YIoLMXtsCrf2F38gc7CbTFItSnEQ08=; b=aUHGnk1nbGiXnlYwL6CC/TihRrVf/pHDlPI9nr9pJCy9193ZLut8+0d2rCJBZpLVex eN4o0yGE495lrgndj3WlX3GOdyfv65OkdLcIQAwO+p+Kf0dL2LjZNOSMdUPpfirJWoQ8 atbfvAK8NL0a4qBbnJlAXEkV8NvsnSdSo6lGEaaPcdhHtlYg+P7s66WNTan8uWypFuQx 4Hs+cgz7tRlF0QrRXSi2aymQ3eUEB4lby6GvccO9FQgItlckENpxddGeuhaXL4E4qO6L kYYVdHmvZyHvEzVp9VJIgo8t0TlCtqNJnPGVqo1M8dEOHVdx9lJY1MtixuCdKy1bZGFv 8WLw== X-Gm-Message-State: AOPr4FWWq/3v0PjDg9xlucAi0R1Clt/PAxd7cVWklguPinirz9uiHu6iJBKe9XlmysfB/6cbwhs1UxAgoTkOKg== MIME-Version: 1.0 X-Received: by 10.50.23.164 with SMTP id n4mr1312896igf.71.1463635799911; Wed, 18 May 2016 22:29:59 -0700 (PDT) Received: by 10.64.18.40 with HTTP; Wed, 18 May 2016 22:29:59 -0700 (PDT) In-Reply-To: References: <3ED6D2A4-DEB3-4A50-B1EF-6992D9E6660A@gmail.com> <49E3F926-34BF-4D70-9A62-C581F903DE7B@gmail.com> <1A08CAB8-70DD-4231-95BE-611B75F338FA@confluent.io> Date: Wed, 18 May 2016 22:29:59 -0700 Message-ID: Subject: Re: [DISCUSS] KIP-58 - Make Log Compaction Point Configurable From: Gwen Shapira To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary=bcaec5396db0ef7eec05332b4337 archived-at: Thu, 19 May 2016 05:30:12 -0000 --bcaec5396db0ef7eec05332b4337 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Oops :) The docs are definitely not doing the feature any favors, but I didn't mean to imply the feature is thoughtless. Here's the thing I'm not getting: You are trading off disk space for IO efficiency. Thats reasonable. But why not allow users to specify space in bytes? Basically tell the LogCompacter: Once I have X bytes of dirty data (or post KIP-58, X bytes of data that needs cleaning), please compact it to the best of your ability (which in steady state will be into almost nothing). Since we know how big the compaction buffer is and how Kafka uses it, we can exactly calculate how much space we are wasting vs. how much IO we are going to do per unit of time. The size of a single segment or compaction buffer (whichever is bigger) can be a good default value for min.dirty.bytes. We can even evaluate and re-evaluate it based on the amount of free space on the disk. Heck, we can automate those tunings (lower min.dirty.bytes to trigger compaction and free space if we are close to running out of space). We can do the same capacity planning with percentages but it requires more information to know the results, information that can only be acquired after you reach steady state. It is a bit obvious, so I'm guessing the idea was considered and dismissed. I just can't see why. If only there were KIPs back then, so I could look at rejected alternatives... Gwen On Wed, May 18, 2016 at 9:54 PM, Jay Kreps wrote: > So in summary we never considered this a mechanism to give the consumer > time to consume prior to compaction, just a mechanism to control space > wastage. It sort of accidentally gives you that but it's super hard to > reason about it as an SLA since it is relative to the log size rather tha= n > absolute. > > -Jay > > On Wed, May 18, 2016 at 9:50 PM, Jay Kreps wrote: > > > The sad part is I actually did think pretty hard about how to configure > > that stuff so I guess *I* think the config makes sense! Clearly trying = to > > prevent my being shot :-) > > > > I agree the name could be improved and the documentation is quite > > spartan--no guidance at all on how to set it or what it trades off. A b= it > > shameful. > > > > The thinking was this. One approach to cleaning would be to just do it > > continually with the idea that, hey, you can't take that I/O with > you--once > > you've budgeted N MB/sec of background I/O for compaction some of the > time, > > you might as well just use that budget all the time. But this leads to > > seemingly silly behavior where you are doing big ass compactions all th= e > > time to free up just a few bytes and we thought it would freak people > out. > > Plus arguably Kafka usage isn't all in steady state so this wastage wou= ld > > come out of the budget for other bursty stuff. > > > > So when should compaction kick in? Well what are you trading off? The > > tradeoff here is how much space to waste on disk versus how much I/O to > use > > in cleaning. In general we can't say exactly how much space a compactio= n > > will free up--during a phase of all "inserts" compaction may free up no > > space at all. You just have to do the compaction and hope for the best. > But > > in general for most compacted topics they should soon reach a "steady > > state" where they aren't growing or growing very slowly, so most writes > are > > updates (if they keep growing rapidly indefinitely then you are going t= o > > run out of space--so safe to assume they do reach steady state). In thi= s > > steady state the ratio of uncompacted log to total log is effectively t= he > > utilization (wasted space percentage). So if you set it to 50% your dat= a > is > > about half duplicates. By tolerating more uncleaned log you get more ba= ng > > for your compaction I/O buck but more space wastage. This seemed like a > > reasonable way to think about it because maybe you know your compacted > data > > size (roughly) so you can reason about whether using, say, twice that > space > > is okay. > > > > Maybe we should just change the name to something about target > utilization > > even though that isn't strictly true except in steady state? > > > > -Jay > > > > > > On Wed, May 18, 2016 at 7:59 PM, Gwen Shapira wrote= : > > > >> Interesting! > >> > >> This needs to be double checked by someone with more experience, but > >> reading the code, it looks like "log.cleaner.min.cleanable.ratio" > >> controls *just* the second property, and I'm not even convinced about > >> that. > >> > >> Few facts: > >> > >> 1. Each cleaner thread cleans one log at a time. It always goes for > >> the log with the largest percentage of non-compacted bytes. If you > >> just created a new partition, wrote 1G and switched to a new segment, > >> it is very likely that this will be the next log to compact. > >> Explaining the behavior Eric and Jay complained about. I expected it > >> to be rare. > >> > >> 2. If the dirtiest log has less than 50% dirty bytes (or whatever > >> min.cleanable is), it will be skipped, knowing that others have even > >> lower ditry ratio. > >> > >> 3. If we do decide to clean a log, we will clean the whole damn thing, > >> leaving only the active segment. Contrary to my expectations, it does > >> not leave any dirty byte behind. So *at most* you will have a single > >> clean segment. Again, explaining why Jay, James and Eric are unhappy. > >> > >> 4. What is does guarantee (kinda? at least I think it tries?) is to > >> always clean a large "chunk" of data at once, hopefully minimizing > >> churn (cleaning small bits off the same log over and over) and > >> minimizing IO. It does have the nice mathematical property of > >> guaranteeing double the amount of time between cleanings (except it > >> doesn't really, because who knows the size of the compacted region). > >> > >> 5. Whoever wrote the docs should be shot :) > >> > >> so, in conclusion: > >> In my mind, min.cleanable.dirty.ratio is terrible, it is misleading, > >> difficult to understand, and IMO doesn't even do what it should do. > >> I would like to consider the possibility of > >> min.cleanable.dirty.bytes, which should give good control over # of IO > >> operations (since the size of compaction buffer is known). > >> > >> In the context of this KIP, the interaction with cleanable ratio and > >> cleanable bytes will be similar, and it looks like it was already done > >> correctly in the PR, so no worries ("the ratio's definition will be > >> expanded to become the ratio of "compactable" to compactable plus > >> compacted message sizes. Where compactable includes log segments that > >> are neither the active segment nor those prohibited from being > >> compacted because they contain messages that do not satisfy all the > >> new lag constraints" > >> > >> I may open a new KIP to handle the cleanable ratio. Please don't let > >> my confusion detract from this KIP. > >> > >> Gwen > >> > >> On Wed, May 18, 2016 at 3:41 PM, Ben Stopford wrote= : > >> > Generally, this seems like a sensible proposal to me. > >> > > >> > Regarding (1): time and message count seem sensible. I can=E2=80=99t= think of > a > >> specific use case for bytes but it seems like there could be one. > >> > > >> > Regarding (2): > >> > The setting log.cleaner.min.cleanable.ratio currently seems to have > two > >> uses. It controls which messages will not be compacted, but it also > >> provides a fractional bound on how many logs are cleaned (and hence wo= rk > >> done) in each round. This new proposal seems aimed at the first use, b= ut > >> not the second. > >> > > >> > The second case better suits a fractional setting like the one we ha= ve > >> now. Using a fractional value means the amount of data cleaned scales = in > >> proportion to the data stored in the log. If we were to replace this > with > >> an absolute value it would create proportionally more cleaning work as > the > >> log grew in size. > >> > > >> > So, if I understand this correctly, I think there is an argument for > >> having both. > >> > > >> > > >> >> On 17 May 2016, at 19:43, Gwen Shapira wrote: > >> >> > >> >> .... and Spark's implementation is another good reason to allow > >> compaction lag. > >> >> > >> >> I'm convinced :) > >> >> > >> >> We need to decide: > >> >> > >> >> 1) Do we need just .ms config, or anything else? consumer lag is > >> >> measured (and monitored) in messages, so if we need this feature to > >> >> somehow work in tandem with consumer lag monitoring, I think we nee= d > >> >> .messages too. > >> >> > >> >> 2) Does this new configuration allows us to get rid of cleaner.rati= o > >> config? > >> >> > >> >> Gwen > >> >> > >> >> > >> >> On Tue, May 17, 2016 at 9:43 AM, Eric Wasserman > >> >> wrote: > >> >>> James, > >> >>> > >> >>> Your pictures do an excellent job of illustrating my point. > >> >>> > >> >>> My mention of the additional "10's of minutes to hours" refers to > how > >> far after the original target checkpoint (T1 in your diagram) on may > need > >> to go to get to a checkpoint where all partitions of all topics are in > the > >> uncompacted region of their respective logs. In terms of your diagram: > the > >> T3 transaction could have been written 10's of minutes to hours after > T1 as > >> that was how much time it took all readers to get to T1. > >> >>> > >> >>>> You would not have to start over from the beginning in order to > read > >> to T3. > >> >>> > >> >>> While I agree this is technically true, in practice it could be ve= ry > >> onerous to actually do it. For example, we use the Kafka consumer that > is > >> part of the Spark Streaming library to read table topics. It accepts a > >> range of offsets to read for each partition. Say we originally target > >> ranges from offset 0 to the offset of T1 for each topic+partition. The= re > >> really is no way to have the library arrive at T1 an then "keep going" > to > >> T3. What is worse, given Spark's design, if you lost a worker during > your > >> calculations you would be in a rather sticky position. Spark achieves > >> resiliency not by data redundancy but by keeping track of how to > reproduce > >> the transformations leading to a state. In the face of a lost worker, > Spark > >> would try to re-read that portion of the data on the lost worker from > >> Kafka. However, in the interim compaction may have moved past the > >> reproducible checkpoint (T3) rendering the data inconsistent. At best > the > >> entire calculation would need to start over targeting some later > >> transaction checkpoint. > >> >>> > >> >>> Needless to say with the proposed feature everything is quite > simple. > >> As long as we set the compaction lag large enough we can be assured > that T1 > >> will remain in the uncompacted region an thereby be reproducible. Thus > >> reading from 0 to the offsets in T1 will be sufficient for the duratio= n > of > >> the calculation. > >> >>> > >> >>> Eric > >> >>> > >> >>> > >> > > >> > > > > > --bcaec5396db0ef7eec05332b4337--