From dev-return-97174-archive-asf-public=cust-asf.ponee.io@kafka.apache.org Wed Aug 15 18:41:30 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id B453A180626 for ; Wed, 15 Aug 2018 18:41:28 +0200 (CEST) Received: (qmail 30016 invoked by uid 500); 15 Aug 2018 16:41:27 -0000 Mailing-List: contact dev-help@kafka.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@kafka.apache.org Delivered-To: mailing list dev@kafka.apache.org Received: (qmail 29996 invoked by uid 99); 15 Aug 2018 16:41:26 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Aug 2018 16:41:26 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0DB2BC1AEB for ; Wed, 15 Aug 2018 16:41:26 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.988 X-Spam-Level: * X-Spam-Status: No, score=1.988 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=confluent-io.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id yHzxHAwM3DaA for ; Wed, 15 Aug 2018 16:41:16 +0000 (UTC) Received: from mail-ed1-f53.google.com (mail-ed1-f53.google.com [209.85.208.53]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id BD9F15F416 for ; Wed, 15 Aug 2018 16:41:15 +0000 (UTC) Received: by mail-ed1-f53.google.com with SMTP id x5-v6so1142622edr.0 for ; Wed, 15 Aug 2018 09:41:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=confluent-io.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=UEpOU7IUzCJBbgTZS9z6iaeFDCCBdX0SuJlopdGtBW8=; b=tdMyHW+A2+s8uY66fuu3v3DAQgDmpchZrTtQJucQLnuFuW3UJ3Lu+McW5+K0sdWH4A Q3Izhg9LUvCRPxCUDwKSsemrkcJnnr9ezItJrNRyQ1y/6VxW/XPFV5mxGIMajzz4iSMO R4fKT5/HfIFkzJDDCPw4EwSgon3L8SLLAjvluwf+OfpiCb+CwwOpSYyn1J9qsLGGQU/p rElPIkPEkulfSxUgOguEEA3eqzG/mqG4zHUTDWN35w6Aqya3ui/BCOkm4UKoJTXF64BL yAqTnRSH8YwoHmE05T4hQRpwzlbrGa5M0r0QagX3M4DvhtHapqI1eUeMg8rMGhl0eh5E 70ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=UEpOU7IUzCJBbgTZS9z6iaeFDCCBdX0SuJlopdGtBW8=; b=PufqugsCLHAXfBJbQ7tiocdNEb1mk2Nb3tENawZ4CCGFncDLC+Stdeib/8eABhbkmt 754lrPjSRVOiTs+gX04x8ueKpNkVSgj9d/Ln5Z/jWZ8qGx8TWwtbULGjSLW4XlxuHc6H j3pfxgPRbhA7KP0k5hq27FHHeOR/SNQ3DjVv0GTKqEv9NpIXuVduPe2KybFFJrkSI9LQ OAOCsfd+zIDTiFQEJWoUVhyiQ1EcbNgjqZbyuFZ/j2A2Qzxmu3HIn8nnTJV9vvQNFx/A asgLukxptbJHOgAowy5uRDGV/DObkPuOBdJoI8KvYXF5nzEn7qsbSdmDDddUpBQLZZLG XJgw== X-Gm-Message-State: AOUpUlGvmxcrckf9WgrxF2KJuW3PmRR+d8VGGBNdxRwcvW59nY6/86MI rLZJCNvygWqOHGG0qENzA5EiSS9BsTBRvEHqqRMBY49snY8= X-Google-Smtp-Source: AA+uWPwSCcOK/SJSjRTParPDF8Oxljxm+Lo+E+Hyysid5qAlJWBiZ1+s2dFKf0rug/5RT/6Uz+E96ygfnhzcNXWc0es= X-Received: by 2002:aa7:c149:: with SMTP id r9-v6mr33438715edp.134.1534351268525; Wed, 15 Aug 2018 09:41:08 -0700 (PDT) MIME-Version: 1.0 References: <7A1E9D59-0093-476A-9B84-A2BC61FDE210@gmail.com> <1532539602.1720093.1452692840.1655A5AA@webmail.messagingengine.com> <996a22ff-581c-2ca6-5203-bb9aa06c54ca@apache.org> <5B87D03C-7575-490A-AE58-8DF361167C58@gmail.com> <1533578349.3528260.1465256632.1B2E1587@webmail.messagingengine.com> In-Reply-To: From: Stanislav Kozlovski Date: Wed, 15 Aug 2018 17:40:57 +0100 Message-ID: Subject: Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure To: dev@kafka.apache.org Content-Type: multipart/alternative; boundary="00000000000052576505737bfe54" --00000000000052576505737bfe54 Content-Type: text/plain; charset="UTF-8" Hi Jason, I was thinking about your suggestion. I agree that it makes sense to cap it at a certain threshold and it doesn't sound *too* restrictive to me either, considering the common case. The issue with the __consumer_offsets topic is problematic, that is true. Nevertheless, I have some concerns with having a certain threshold of `uncleanable.bytes`. There is now a chance that a single error in a big partition (other than __consumer_offsets) marks the directory as offline outright. To avoid this, we would need to have it be set to *at least* half of the biggest compacted partition's size - this is since the default of `log.cleaner.min.cleanable.ratio` is 0.5. Even then, that single partition will quickly go over the threshold since it is not cleaned at all. Ideally, we'd want to validate that more partitions are failing before marking the disk as offline to best ensure it is an actual disk problem. Having a threshold makes this tricky. Placing a reasonable default value seems very hard too, as it would either be too small (mark too fast) or too big (never mark offline) for some users, which would cause issues in the former case. Perhaps the best approach would be to have the functionality be disabled by default. I am now left with the conclusion that it's best to have that functionality be disabled by default. Since configs are relatively easy to add but hard to take away, I believe it might be best to drop off that functionality in this KIP. We could consider adding it later if the community believes it is needed. I consider that a reasonable approach, since the main perceived benefit of this KIP is the isolation of partition failures and to some extent the new metrics. What are other people's thoughts on this? I have updated the KIP accordingly. Best, Stanislav On Wed, Aug 15, 2018 at 12:27 AM Jason Gustafson wrote: > Sorry for the noise. Let me try again: > > My initial suggestion was to *track *the uncleanable disk space. > > I can see why marking a log directory as offline after a certain > threshold > > of uncleanable disk space is more useful. > > I'm not sure if we can set that threshold to be of certain size (e.g > 100GB) > > as log directories might have different sizes. Maybe a percentage would > be > > better then (e.g 30% of whole log dir size), WDYT? > > > The two most common problems I am aware of when the log cleaner crashes are > 1) running out of disk space and 2) excessive coordinator loading time. The > problem in the latter case is that when the log cleaner is not running, the > __consumer_offsets topics can become huge. If there is a failure which > causes a coordinator change, then it can take a long time for the new > coordinator to load the offset cache since it reads from the beginning. > Consumers are effectively dead in the water when this happens since they > cannot commit offsets. We've seen coordinator loading times in the hours > for some users. If we could set a total cap on the uncleanable size, then > we can reduce the impact from unbounded __consumer_offsets growth. > > Also it's true that log directories may have different sizes, but I'm not > sure that is a common case. I don't think it would be too restrictive to > use a single max size for all directories. I think the key is just having > some way to cap the size of the uncleaned data. > > I feel it still makes sense to have a metric tracking how many uncleanable > > partitions there are and the total amount of uncleanable disk space (per > > log dir, via a JMX tag). > > But now, rather than fail the log directory after a certain count of > > uncleanable partitions, we could fail it after a certain percentage (or > > size) of its storage is uncleanable. > > > Yes, having the metric for uncleanable partitions could be useful. I was > mostly concerned about the corresponding config since it didn't seem to > address the main problems with the cleaner dying. > > Thanks, > Jason > > On Tue, Aug 14, 2018 at 4:11 PM, Jason Gustafson > wrote: > > > Hey Stanislav, responses below: > > > > My initial suggestion was to *track *the uncleanable disk space. > >> I can see why marking a log directory as offline after a certain > threshold > >> of uncleanable disk space is more useful. > >> I'm not sure if we can set that threshold to be of certain size (e.g > >> 100GB) > >> as log directories might have different sizes. Maybe a percentage would > >> be > >> better then (e.g 30% of whole log dir size), WDYT? > > > > > > > > > > > > On Fri, Aug 10, 2018 at 2:05 AM, Stanislav Kozlovski < > > stanislav@confluent.io> wrote: > > > >> Hey Jason, > >> > >> My initial suggestion was to *track *the uncleanable disk space. > >> I can see why marking a log directory as offline after a certain > threshold > >> of uncleanable disk space is more useful. > >> I'm not sure if we can set that threshold to be of certain size (e.g > >> 100GB) > >> as log directories might have different sizes. Maybe a percentage would > >> be > >> better then (e.g 30% of whole log dir size), WDYT? > >> > >> I feel it still makes sense to have a metric tracking how many > uncleanable > >> partitions there are and the total amount of uncleanable disk space (per > >> log dir, via a JMX tag). > >> But now, rather than fail the log directory after a certain count of > >> uncleanable partitions, we could fail it after a certain percentage (or > >> size) of its storage is uncleanable. > >> > >> I'd like to hear other people's thoughts on this. Sound good? > >> > >> Best, > >> Stanislav > >> > >> > >> > >> > >> On Fri, Aug 10, 2018 at 12:40 AM Jason Gustafson > >> wrote: > >> > >> > Hey Stanislav, > >> > > >> > Sorry, I was probably looking at an older version (I had the tab open > >> for > >> > so long!). > >> > > >> > I have been thinking about `max.uncleanable.partitions` and wondering > if > >> > it's what we really want. The main risk if the cleaner cannot clean a > >> > partition is eventually running out of disk space. This is the most > >> common > >> > problem we have seen with cleaner failures and it can happen even if > >> there > >> > is just one uncleanable partition. We've actually seen cases in which > a > >> > single __consumer_offsets grew large enough to fill a significant > >> portion > >> > of the disk. The difficulty with allowing a system to run out of disk > >> space > >> > before failing is that it makes recovery difficult and time consuming. > >> > Clean shutdown, for example, requires writing some state to disk. > >> Without > >> > clean shutdown, it can take the broker significantly longer to startup > >> > because it has do more segment recovery. > >> > > >> > For this problem, `max.uncleanable.partitions` does not really help. > You > >> > can set it to 1 and fail fast, but that is not much better than the > >> > existing state. You had a suggestion previously in the KIP to use the > >> size > >> > of uncleanable disk space instead. What was the reason for rejecting > >> that? > >> > Intuitively, it seems like a better fit for a cleaner failure. It > would > >> > provide users some time to react to failures while still protecting > them > >> > from exhausting the disk. > >> > > >> > Thanks, > >> > Jason > >> > > >> > > >> > > >> > > >> > On Thu, Aug 9, 2018 at 9:46 AM, Stanislav Kozlovski < > >> > stanislav@confluent.io> > >> > wrote: > >> > > >> > > Hey Jason, > >> > > > >> > > 1. *10* is the default value, it says so in the KIP > >> > > 2. This is a good catch. As the current implementation stands, it's > >> not a > >> > > useful metric since the thread continues to run even if all log > >> > directories > >> > > are offline (although I'm not sure what the broker's behavior is in > >> that > >> > > scenario). I'll make sure the thread stops if all log directories > are > >> > > online. > >> > > > >> > > I don't know which "Needs Discussion" item you're referencing, there > >> > hasn't > >> > > been any in the KIP since August 1 and that was for the metric only. > >> KIP > >> > > History > >> > > >> ons.action? > >> > > pageId=89064875> > >> > > > >> > > I've updated the KIP to mention the "time-since-last-run" metric. > >> > > > >> > > Thanks, > >> > > Stanislav > >> > > > >> > > On Wed, Aug 8, 2018 at 12:12 AM Jason Gustafson > > >> > > wrote: > >> > > > >> > > > Hi Stanislav, > >> > > > > >> > > > Just a couple quick questions: > >> > > > > >> > > > 1. I may have missed it, but what will be the default value for > >> > > > `max.uncleanable.partitions`? > >> > > > 2. It seems there will be some impact for users that monitoring > >> > > > "time-since-last-run-ms" in order to detect cleaner failures. Not > >> sure > >> > > it's > >> > > > a major concern, but probably worth mentioning in the > compatibility > >> > > > section. Also, is this still a useful metric after this KIP? > >> > > > > >> > > > Also, maybe the "Needs Discussion" item can be moved to rejected > >> > > > alternatives since you've moved to a vote? I think leaving this > for > >> > > > potential future work is reasonable. > >> > > > > >> > > > Thanks, > >> > > > Jason > >> > > > > >> > > > > >> > > > On Mon, Aug 6, 2018 at 12:29 PM, Ray Chiang > >> > wrote: > >> > > > > >> > > > > I'm okay with that. > >> > > > > > >> > > > > -Ray > >> > > > > > >> > > > > On 8/6/18 10:59 AM, Colin McCabe wrote: > >> > > > > > >> > > > >> Perhaps we could start with max.uncleanable.partitions and then > >> > > > implement > >> > > > >> max.uncleanable.partitions.per.logdir in a follow-up change if > >> it > >> > > seemed > >> > > > >> to be necessary? What do you think? > >> > > > >> > >> > > > >> regards, > >> > > > >> Colin > >> > > > >> > >> > > > >> > >> > > > >> On Sat, Aug 4, 2018, at 10:53, Stanislav Kozlovski wrote: > >> > > > >> > >> > > > >>> Hey Ray, > >> > > > >>> > >> > > > >>> Thanks for the explanation. In regards to the configuration > >> > property > >> > > - > >> > > > >>> I'm > >> > > > >>> not sure. As long as it has sufficient documentation, I find > >> > > > >>> "max.uncleanable.partitions" to be okay. If we were to add the > >> > > > >>> distinction > >> > > > >>> explicitly, maybe it should be `max.uncleanable.partitions. > >> > > per.logdir` > >> > > > ? > >> > > > >>> > >> > > > >>> On Thu, Aug 2, 2018 at 7:32 PM Ray Chiang > > >> > > wrote: > >> > > > >>> > >> > > > >>> One more thing occurred to me. Should the configuration > >> property > >> > be > >> > > > >>>> named "max.uncleanable.partitions.per.disk" instead? > >> > > > >>>> > >> > > > >>>> -Ray > >> > > > >>>> > >> > > > >>>> > >> > > > >>>> On 8/1/18 9:11 AM, Stanislav Kozlovski wrote: > >> > > > >>>> > >> > > > >>>>> Yes, good catch. Thank you, James! > >> > > > >>>>> > >> > > > >>>>> Best, > >> > > > >>>>> Stanislav > >> > > > >>>>> > >> > > > >>>>> On Wed, Aug 1, 2018 at 5:05 PM James Cheng < > >> wushujames@gmail.com > >> > > > >> > > > >>>>> wrote: > >> > > > >>>>> > >> > > > >>>>> Can you update the KIP to say what the default is for > >> > > > >>>>>> max.uncleanable.partitions? > >> > > > >>>>>> > >> > > > >>>>>> -James > >> > > > >>>>>> > >> > > > >>>>>> Sent from my iPhone > >> > > > >>>>>> > >> > > > >>>>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski < > >> > > > >>>>>>> > >> > > > >>>>>> stanislav@confluent.io> > >> > > > >>>> > >> > > > >>>>> wrote: > >> > > > >>>>>> > >> > > > >>>>>>> Hey group, > >> > > > >>>>>>> > >> > > > >>>>>>> I am planning on starting a voting thread tomorrow. Please > >> do > >> > > reply > >> > > > >>>>>>> if > >> > > > >>>>>>> > >> > > > >>>>>> you > >> > > > >>>>>> > >> > > > >>>>>>> feel there is anything left to discuss. > >> > > > >>>>>>> > >> > > > >>>>>>> Best, > >> > > > >>>>>>> Stanislav > >> > > > >>>>>>> > >> > > > >>>>>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski < > >> > > > >>>>>>> > >> > > > >>>>>> stanislav@confluent.io> > >> > > > >>>>>> > >> > > > >>>>>>> wrote: > >> > > > >>>>>>> > >> > > > >>>>>>> Hey, Ray > >> > > > >>>>>>>> > >> > > > >>>>>>>> Thanks for pointing that out, it's fixed now > >> > > > >>>>>>>> > >> > > > >>>>>>>> Best, > >> > > > >>>>>>>> Stanislav > >> > > > >>>>>>>> > >> > > > >>>>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang < > >> > rchiang@apache.org> > >> > > > >>>>>>>>> > >> > > > >>>>>>>> wrote: > >> > > > >>>> > >> > > > >>>>> Thanks. Can you fix the link in the "KIPs under discussion" > >> > table > >> > > on > >> > > > >>>>>>>>> the main KIP landing page > >> > > > >>>>>>>>> < > >> > > > >>>>>>>>> > >> > > > >>>>>>>>> > https://cwiki.apache.org/confluence/display/KAFKA/Kafka+ > >> > > > >>>> Improvement+Proposals# > >> > > > >>>> > >> > > > >>>>> ? > >> > > > >>>>>>> > >> > > > >>>>>>>> I tried, but the Wiki won't let me. > >> > > > >>>>>>>>> > >> > > > >>>>>>>>> -Ray > >> > > > >>>>>>>>> > >> > > > >>>>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote: > >> > > > >>>>>>>>>> Hey guys, > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> @Colin - good point. I added some sentences mentioning > >> > recent > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> improvements > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> in the introductory section. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> *Disk Failure* - I tend to agree with what Colin said - > >> > once a > >> > > > >>>>>>>>>> disk > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> fails, > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> you don't want to work with it again. As such, I've > >> changed > >> > my > >> > > > >>>>>>>>>> mind > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> and > >> > > > >>>>>> > >> > > > >>>>>>> believe that we should mark the LogDir (assume its a disk) > >> as > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> offline > >> > > > >>>> > >> > > > >>>>> on > >> > > > >>>>>> > >> > > > >>>>>>> the first `IOException` encountered. This is the > >> LogCleaner's > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> current > >> > > > >>>> > >> > > > >>>>> behavior. We shouldn't change that. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> *Respawning Threads* - I believe we should never > >> re-spawn a > >> > > > >>>>>>>>>> thread. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> The > >> > > > >>>>>> > >> > > > >>>>>>> correct approach in my mind is to either have it stay dead > >> or > >> > > never > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> let > >> > > > >>>>>> > >> > > > >>>>>>> it > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> die in the first place. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> *Uncleanable-partition-names metric* - Colin is right, > >> this > >> > > > metric > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> is > >> > > > >>>> > >> > > > >>>>> unneeded. Users can monitor the > `uncleanable-partitions-count` > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> metric > >> > > > >>>> > >> > > > >>>>> and > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> inspect logs. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> Hey Ray, > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> 2) I'm 100% with James in agreement with setting up the > >> > > > LogCleaner > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> to > >> > > > >>>> > >> > > > >>>>> skip over problematic partitions instead of dying. > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> I think we can do this for every exception that isn't > >> > > > >>>>>>>>>> `IOException`. > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> This > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> will future-proof us against bugs in the system and > >> > potential > >> > > > >>>>>>>>>> other > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> errors. > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> Protecting yourself against unexpected failures is > >> always a > >> > > good > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> thing > >> > > > >>>> > >> > > > >>>>> in > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> my mind, but I also think that protecting yourself > >> against > >> > > bugs > >> > > > in > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> the > >> > > > >>>> > >> > > > >>>>> software is sort of clunky. What does everybody think about > >> this? > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> 4) The only improvement I can think of is that if such > an > >> > > > >>>>>>>>>>> error occurs, then have the option (configuration > >> setting?) > >> > > to > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> create a > >> > > > >>>>>> > >> > > > >>>>>>> .skip file (or something similar). > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> This is a good suggestion. Have others also seen > >> corruption > >> > be > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> generally > >> > > > >>>>>> > >> > > > >>>>>>> tied to the same segment? > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah < > >> > > > >>>>>>>>>> dhruvil@confluent.io > >> > > > >>>>>>>>>> > >> > > > >>>>>>>>> wrote: > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> For the cleaner thread specifically, I do not think > >> > respawning > >> > > > >>>>>>>>>>> will > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> help at > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> all because we are more than likely to run into the > same > >> > issue > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> again > >> > > > >>>> > >> > > > >>>>> which > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> would end up crashing the cleaner. Retrying makes sense > >> for > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> transient > >> > > > >>>> > >> > > > >>>>> errors or when you believe some part of the system could > have > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> healed > >> > > > >>>> > >> > > > >>>>> itself, both of which I think are not true for the log > >> cleaner. > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino < > >> > > > >>>>>>>>>>> rndgstn@gmail.com> > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>> wrote: > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> << by > >> > > putting > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> you > >> > > > >>>> > >> > > > >>>>> in > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> an > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> infinite loop which consumes resources and fires off > >> > > > continuous > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> log > >> > > > >>>> > >> > > > >>>>> messages. > >> > > > >>>>>>>>>>>> Hi Colin. In case it could be relevant, one way to > >> > mitigate > >> > > > >>>>>>>>>>>> this > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> effect > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> is > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> to implement a backoff mechanism (if a second respawn > >> is > >> > to > >> > > > >>>>>>>>>>>> occur > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> then > >> > > > >>>>>> > >> > > > >>>>>>> wait > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> for 1 minute before doing it; then if a third respawn > >> is > >> > to > >> > > > >>>>>>>>>>>> occur > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> wait > >> > > > >>>>>> > >> > > > >>>>>>> for > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, > >> etc. > >> > > up > >> > > > to > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> some > >> > > > >>>>>> > >> > > > >>>>>>> max > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> wait time). > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>> I have no opinion on whether respawn is appropriate > or > >> not > >> > > in > >> > > > >>>>>>>>>>>> this > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> context, > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> but a mitigation like the increasing backoff > described > >> > above > >> > > > may > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> be > >> > > > >>>> > >> > > > >>>>> relevant in weighing the pros and cons. > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>> Ron > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe < > >> > > > >>>>>>>>>>>> cmccabe@apache.org> > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>> wrote: > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote: > >> > > > >>>>>>>>>>>>>> Hi Stanislav! Thanks for this KIP! > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> I agree that it would be good if the LogCleaner > were > >> > more > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> tolerant > >> > > > >>>> > >> > > > >>>>> of > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> errors. Currently, as you said, once it dies, it stays > >> dead. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> Things are better now than they used to be. We have > >> the > >> > > > metric > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManager,name=time- > >> > > since-last-run-ms > >> > > > >>>> > >> > > > >>>>> which we can use to tell us if the threads are dead. And as > of > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> 1.1.0, > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> we > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> have KIP-226, which allows you to restart the log > >> cleaner > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> thread, > >> > > > >>>> > >> > > > >>>>> without requiring a broker restart. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+ > >> > > > >>>> Dynamic+Broker+Configuration > >> > > > >>>> > >> > > > >>>>> < > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> https://cwiki.apache.org/confl > >> uence/display/KAFKA/KIP- > >> > > 226+-+ > >> > > > >>>> Dynamic+Broker+Configuration > >> > > > >>>> > >> > > > >>>>> I've only read about this, I haven't personally tried it. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> Thanks for pointing this out, James! Stanislav, we > >> > should > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> probably > >> > > > >>>> > >> > > > >>>>> add a > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> sentence or two mentioning the KIP-226 changes > >> somewhere > >> > in > >> > > > the > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> KIP. > >> > > > >>>>>> > >> > > > >>>>>>> Maybe > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> in the intro section? > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> I think it's clear that requiring the users to > >> manually > >> > > > restart > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> the > >> > > > >>>> > >> > > > >>>>> log > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> cleaner is not a very good solution. But it's good to > >> know > >> > > that > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> it's a > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> possibility on some older releases. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> Some comments: > >> > > > >>>>>>>>>>>>>> * I like the idea of having the log cleaner > continue > >> to > >> > > > clean > >> > > > >>>>>>>>>>>>>> as > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> many > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> partitions as it can, skipping over the problematic > ones > >> if > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> possible. > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> * If the log cleaner thread dies, I think it should > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> automatically > >> > > > >>>> > >> > > > >>>>> be > >> > > > >>>>>> > >> > > > >>>>>>> revived. Your KIP attempts to do that by catching > exceptions > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> during > >> > > > >>>>>> > >> > > > >>>>>>> execution, but I think we should go all the way and make > >> sure > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> that a > >> > > > >>>>>> > >> > > > >>>>>>> new > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> one gets created, if the thread ever dies. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> This is inconsistent with the way the rest of Kafka > >> > works. > >> > > > We > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> don't > >> > > > >>>>>> > >> > > > >>>>>>> automatically re-create other threads in the broker if > they > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> terminate. > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> In > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> general, if there is a serious bug in the code, > >> > respawning > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> threads > >> > > > >>>> > >> > > > >>>>> is > >> > > > >>>>>> > >> > > > >>>>>>> likely to make things worse, by putting you in an infinite > >> loop > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> which > >> > > > >>>>>> > >> > > > >>>>>>> consumes resources and fires off continuous log messages. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> * It might be worth trying to re-clean the > uncleanable > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> partitions. > >> > > > >>>> > >> > > > >>>>> I've > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> seen cases where an uncleanable partition later > became > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> cleanable. > >> > > > >>>> > >> > > > >>>>> I > >> > > > >>>>>> > >> > > > >>>>>>> unfortunately don't remember how that happened, but I > >> remember > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> being > >> > > > >>>>>> > >> > > > >>>>>>> surprised when I discovered it. It might have been > something > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> like > >> > > > >>>> > >> > > > >>>>> a > >> > > > >>>>>> > >> > > > >>>>>>> follower was uncleanable but after a leader election > >> happened, > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> the > >> > > > >>>> > >> > > > >>>>> log > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> truncated and it was then cleanable again. I'm not > >> sure. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> James, I disagree. We had this behavior in the > Hadoop > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> Distributed > >> > > > >>>> > >> > > > >>>>> File > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> System (HDFS) and it was a constant source of user > >> problems. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> What would happen is disks would just go bad over > >> time. > >> > > The > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> DataNode > >> > > > >>>>>> > >> > > > >>>>>>> would notice this and take them offline. But then, due to > >> some > >> > > > >>>>>>>>>>>>> "optimistic" code, the DataNode would periodically > >> try to > >> > > > >>>>>>>>>>>>> re-add > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> them > >> > > > >>>>>> > >> > > > >>>>>>> to > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> the system. Then one of two things would happen: the > >> disk > >> > > > would > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> just > >> > > > >>>>>> > >> > > > >>>>>>> fail > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> immediately again, or it would appear to work and > then > >> > fail > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> after a > >> > > > >>>> > >> > > > >>>>> short > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> amount of time. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> The way the disk failed was normally having an I/O > >> > request > >> > > > >>>>>>>>>>>>> take a > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> really > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> long time and time out. So a bunch of request > handler > >> > > threads > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> would > >> > > > >>>>>> > >> > > > >>>>>>> basically slam into a brick wall when they tried to access > >> the > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> bad > >> > > > >>>> > >> > > > >>>>> disk, > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> slowing the DataNode to a crawl. It was even worse > in > >> the > >> > > > >>>>>>>>>>>>> second > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> scenario, > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> if the disk appeared to work for a while, but then > >> > failed. > >> > > > Any > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> data > >> > > > >>>>>> > >> > > > >>>>>>> that > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> had been written on that DataNode to that disk would > be > >> > > lost, > >> > > > >>>>>>>>>>>>> and > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> we > >> > > > >>>>>> > >> > > > >>>>>>> would > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> need to re-replicate it. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> Disks aren't biological systems-- they don't heal > over > >> > > time. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> Once > >> > > > >>>> > >> > > > >>>>> they're > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> bad, they stay bad. The log cleaner needs to be > >> robust > >> > > > against > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> cases > >> > > > >>>>>> > >> > > > >>>>>>> where > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> the disk really is failing, and really is returning > >> bad > >> > > data > >> > > > or > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> timing > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> out. > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> * For your metrics, can you spell out the full > metric > >> in > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> JMX-style > >> > > > >>>> > >> > > > >>>>> format, such as: > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> > kafka.log:type=LogCleanerManager,name=uncleanable- > >> > > > >>>> partitions-count > >> > > > >>>> > >> > > > >>>>> value=4 > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> * For "uncleanable-partitions": topic-partition > names > >> > can > >> > > be > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> very > >> > > > >>>> > >> > > > >>>>> long. > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> I think the current max size is 210 characters (or > >> maybe > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> 240-ish?). > >> > > > >>>>>> > >> > > > >>>>>>> Having the "uncleanable-partitions" being a list could be > >> very > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> large > >> > > > >>>>>> > >> > > > >>>>>>> metric. Also, having the metric come out as a csv might be > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> difficult > >> > > > >>>>>> > >> > > > >>>>>>> to > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> work with for monitoring systems. If we *did* want > the > >> > topic > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> names > >> > > > >>>> > >> > > > >>>>> to > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> be > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> accessible, what do you think of having the > >> > > > >>>>>>>>>>>>>> kafka.log:type=LogCleanerManag > >> > > > >>>>>>>>>>>>>> er,topic=topic1,partition=2 > >> > > > >>>>>>>>>>>>>> I'm not sure if LogCleanerManager is the right > type, > >> but > >> > > my > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> example > >> > > > >>>>>> > >> > > > >>>>>>> was > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> that the topic and partition can be tags in the > metric. > >> > That > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> will > >> > > > >>>> > >> > > > >>>>> allow > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> monitoring systems to more easily slice and dice the > >> > metric. > >> > > > I'm > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> not > >> > > > >>>>>> > >> > > > >>>>>>> sure what the attribute for that metric would be. Maybe > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> something > >> > > > >>>> > >> > > > >>>>> like > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> time-since-last-clean? > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> Or > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> maybe even just "Value=1". > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> I haven't though about this that hard, but do we > >> really > >> > > need > >> > > > >>>>>>>>>>>>> the > >> > > > >>>>>>>>>>>>> uncleanable topic names to be accessible through a > >> > metric? > >> > > > It > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> seems > >> > > > >>>>>> > >> > > > >>>>>>> like > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> the admin should notice that uncleanable partitions > are > >> > > > present, > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> and > >> > > > >>>>>> > >> > > > >>>>>>> then > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> check the logs? > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> * About `max.uncleanable.partitions`, you said that > >> this > >> > > > likely > >> > > > >>>>>>>>>>>>>> indicates that the disk is having problems. I'm not > >> sure > >> > > > that > >> > > > >>>>>>>>>>>>>> is > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> the > >> > > > >>>>>> > >> > > > >>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> problems, > >> > > > >>>>>> > >> > > > >>>>>>> all > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> of them are partition-level scenarios that happened > >> > during > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> normal > >> > > > >>>> > >> > > > >>>>> operation. None of them were indicative of disk problems. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> I don't think this is a meaningful comparison. In > >> > general, > >> > > > we > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> don't > >> > > > >>>>>> > >> > > > >>>>>>> accept JIRAs for hard disk problems that happen on a > >> particular > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> cluster. > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> If someone opened a JIRA that said "my hard disk is > >> having > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> problems" > >> > > > >>>>>> > >> > > > >>>>>>> we > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> could close that as "not a Kafka bug." This doesn't > >> prove > >> > > that > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> disk > >> > > > >>>>>> > >> > > > >>>>>>> problems don't happen, but just that JIRA isn't the right > >> > place > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> for > >> > > > >>>>>> > >> > > > >>>>>>> them. > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> I do agree that the log cleaner has had a > significant > >> > > number > >> > > > of > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> logic > >> > > > >>>>>> > >> > > > >>>>>>> bugs, and that we need to be careful to limit their > impact. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> That's > >> > > > >>>> > >> > > > >>>>> one > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> reason why I think that a threshold of "number of > >> > uncleanable > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> logs" > >> > > > >>>> > >> > > > >>>>> is > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> a > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> good idea, rather than just failing after one > >> IOException. > >> > > In > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> all > >> > > > >>>> > >> > > > >>>>> the > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> cases I've seen where a user hit a logic bug in the log > >> > > cleaner, > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> it > >> > > > >>>> > >> > > > >>>>> was > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> just one partition that had the issue. We also should > >> > > increase > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> test > >> > > > >>>>>> > >> > > > >>>>>>> coverage for the log cleaner. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> * About marking disks as offline when exceeding a > >> certain > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> threshold, > >> > > > >>>>>> > >> > > > >>>>>>> that actually increases the blast radius of log compaction > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> failures. > >> > > > >>>>>> > >> > > > >>>>>>> Currently, the uncleaned partitions are still readable and > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> writable. > >> > > > >>>>>> > >> > > > >>>>>>> Taking the disks offline would impact availability of the > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> uncleanable > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> partitions, as well as impact all other partitions that > >> are > >> > on > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> the > >> > > > >>>> > >> > > > >>>>> disk. > >> > > > >>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> In general, when we encounter I/O errors, we take > the > >> > disk > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>> partition > >> > > > >>>>>> > >> > > > >>>>>>> offline. This is spelled out in KIP-112 ( > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-112% > >> > > > >>>> 3A+Handle+disk+failure+for+JBOD > >> > > > >>>> > >> > > > >>>>> ) : > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> - Broker assumes a log directory to be good after it > >> > > starts, > >> > > > >>>>>>>>>>>>>> and > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> mark > >> > > > >>>>>>>>> > >> > > > >>>>>>>>>> log directory as > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> bad once there is IOException when broker attempts > to > >> > > access > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> (i.e. > >> > > > >>>> > >> > > > >>>>> read > >> > > > >>>>>>>>>>> > >> > > > >>>>>>>>>>>> or write) the log directory. > >> > > > >>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>>> - Broker will be offline if all log directories are > >> bad. > >> > > > >>>>>>>>>>>>>> - Broker will stop serving replicas in any bad log > >> > > > directory. > >> > > > >>>>>>>>>>>>>> > >> > > > >>>>>>>>>>>>> New > >> > > > >>> > >> > > > >>> > >> > > > > >> > > > >> > > > >> > > -- > >> > > Best, > >> > > Stanislav > >> > > > >> > > >> > >> > >> -- > >> Best, > >> Stanislav > >> > > > > > -- Best, Stanislav --00000000000052576505737bfe54--