kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Kozlovski <stanis...@confluent.io>
Subject Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure
Date Wed, 01 Aug 2018 09:01:11 GMT
Hey Ray,

* Yes, we'd have the logDir as a tag in the metric
* The intention is to have Int.MaxValue as the maximum uncleanable
partitions count
* My idea is to store the marked logs (actually partitions) in memory
instead of the ".skip" files to keep the change simple. I have also decided
to omit any retries from the implementation - once a partition is marked as
"uncleanable" it stays so until a broker restart

Please do let me know if you are okay with this description. I should have
the code available for review soon

Thanks,
Stanislav

On Tue, Jul 31, 2018 at 6:30 PM Ray Chiang <rchiang@apache.org> wrote:

> I had one question that I was trying to do some investigation before I
> asked, but I'm having some issues with my JMX browser right now.
>
>   * For the uncleanable-partitions-count metric, is that going to be
>     per-logDir entry?
>   * For max.uncleanable.partitions, is the intention to have -1 be
>     "infinite" or are we going to use Int.MaxValue as a practical
>     equivalent?
>   * In this sentence: "When evaluating which logs to compact, skip the
>     marked ones.", should we define what "marking" will be?  If we're
>     going with the ".skip" file or equivalent, can we also add how
>     successful retries will behave?
>
> -Ray
>
> On 7/31/18 9:56 AM, Stanislav Kozlovski wrote:
> > Hey group,
> >
> > I am planning on starting a voting thread tomorrow. Please do reply if
> you
> > feel there is anything left to discuss.
> >
> > Best,
> > Stanislav
> >
> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> stanislav@confluent.io>
> > wrote:
> >
> >> Hey, Ray
> >>
> >> Thanks for pointing that out, it's fixed now
> >>
> >> Best,
> >> Stanislav
> >>
> >> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchiang@apache.org> wrote:
> >>
> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
> >>> the main KIP landing page
> >>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >?
> >>>
> >>> I tried, but the Wiki won't let me.
> >>>
> >>> -Ray
> >>>
> >>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>> Hey guys,
> >>>>
> >>>> @Colin - good point. I added some sentences mentioning recent
> >>> improvements
> >>>> in the introductory section.
> >>>>
> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
> >>> fails,
> >>>> you don't want to work with it again. As such, I've changed my mind
> and
> >>>> believe that we should mark the LogDir (assume its a disk) as offline
> on
> >>>> the first `IOException` encountered. This is the LogCleaner's current
> >>>> behavior. We shouldn't change that.
> >>>>
> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
> The
> >>>> correct approach in my mind is to either have it stay dead or never
> let
> >>> it
> >>>> die in the first place.
> >>>>
> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric is
> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
> >>> and
> >>>> inspect logs.
> >>>>
> >>>>
> >>>> Hey Ray,
> >>>>
> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
to
> >>>>> skip over problematic partitions instead of dying.
> >>>> I think we can do this for every exception that isn't `IOException`.
> >>> This
> >>>> will future-proof us against bugs in the system and potential other
> >>> errors.
> >>>> Protecting yourself against unexpected failures is always a good thing
> >>> in
> >>>> my mind, but I also think that protecting yourself against bugs in the
> >>>> software is sort of clunky. What does everybody think about this?
> >>>>
> >>>>> 4) The only improvement I can think of is that if such an
> >>>>> error occurs, then have the option (configuration setting?) to
> create a
> >>>>> <log_segment>.skip file (or something similar).
> >>>> This is a good suggestion. Have others also seen corruption be
> generally
> >>>> tied to the same segment?
> >>>>
> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io>
> >>> wrote:
> >>>>> For the cleaner thread specifically, I do not think respawning will
> >>> help at
> >>>>> all because we are more than likely to run into the same issue again
> >>> which
> >>>>> would end up crashing the cleaner. Retrying makes sense for transient
> >>>>> errors or when you believe some part of the system could have healed
> >>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>
> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndgstn@gmail.com>
> >>> wrote:
> >>>>>> <<<respawning threads is likely to make things worse,
by putting you
> >>> in
> >>>>> an
> >>>>>> infinite loop which consumes resources and fires off continuous
log
> >>>>>> messages.
> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
this
> >>> effect
> >>>>> is
> >>>>>> to implement a backoff mechanism (if a second respawn is to
occur
> then
> >>>>> wait
> >>>>>> for 1 minute before doing it; then if a third respawn is to
occur
> wait
> >>>>> for
> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc. up
to
> some
> >>> max
> >>>>>> wait time).
> >>>>>>
> >>>>>> I have no opinion on whether respawn is appropriate or not in
this
> >>>>> context,
> >>>>>> but a mitigation like the increasing backoff described above
may be
> >>>>>> relevant in weighing the pros and cons.
> >>>>>>
> >>>>>> Ron
> >>>>>>
> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmccabe@apache.org>
> >>> wrote:
> >>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
> >>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>
> >>>>>>>> I agree that it would be good if the LogCleaner were
more tolerant
> >>> of
> >>>>>>>> errors. Currently, as you said, once it dies, it stays
dead.
> >>>>>>>>
> >>>>>>>> Things are better now than they used to be. We have
the metric
> >>>>>>>>
>  kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>> which we can use to tell us if the threads are dead.
And as of
> >>> 1.1.0,
> >>>>>> we
> >>>>>>>> have KIP-226, which allows you to restart the log cleaner
thread,
> >>>>>>>> without requiring a broker restart.
> >>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>> I've only read about this, I haven't personally tried
it.
> >>>>>>> Thanks for pointing this out, James!  Stanislav, we should
probably
> >>>>> add a
> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere
in the
> KIP.
> >>>>>> Maybe
> >>>>>>> in the intro section?
> >>>>>>>
> >>>>>>> I think it's clear that requiring the users to manually
restart the
> >>> log
> >>>>>>> cleaner is not a very good solution.  But it's good to know
that
> >>> it's a
> >>>>>>> possibility on some older releases.
> >>>>>>>
> >>>>>>>> Some comments:
> >>>>>>>> * I like the idea of having the log cleaner continue
to clean as
> >>> many
> >>>>>>>> partitions as it can, skipping over the problematic
ones if
> >>> possible.
> >>>>>>>> * If the log cleaner thread dies, I think it should
automatically
> be
> >>>>>>>> revived. Your KIP attempts to do that by catching exceptions
> during
> >>>>>>>> execution, but I think we should go all the way and
make sure
> that a
> >>>>>> new
> >>>>>>>> one gets created, if the thread ever dies.
> >>>>>>> This is inconsistent with the way the rest of Kafka works.
 We
> don't
> >>>>>>> automatically re-create other threads in the broker if they
> >>> terminate.
> >>>>>> In
> >>>>>>> general, if there is a serious bug in the code, respawning
threads
> is
> >>>>>>> likely to make things worse, by putting you in an infinite
loop
> which
> >>>>>>> consumes resources and fires off continuous log messages.
> >>>>>>>
> >>>>>>>> * It might be worth trying to re-clean the uncleanable
partitions.
> >>>>> I've
> >>>>>>>> seen cases where an uncleanable partition later became
cleanable.
> I
> >>>>>>>> unfortunately don't remember how that happened, but
I remember
> being
> >>>>>>>> surprised when I discovered it. It might have been something
like
> a
> >>>>>>>> follower was uncleanable but after a leader election
happened, the
> >>>>> log
> >>>>>>>> truncated and it was then cleanable again. I'm not sure.
> >>>>>>> James, I disagree.  We had this behavior in the Hadoop Distributed
> >>> File
> >>>>>>> System (HDFS) and it was a constant source of user problems.
> >>>>>>>
> >>>>>>> What would happen is disks would just go bad over time.
 The
> DataNode
> >>>>>>> would notice this and take them offline.  But then, due
to some
> >>>>>>> "optimistic" code, the DataNode would periodically try to
re-add
> them
> >>>>> to
> >>>>>>> the system.  Then one of two things would happen: the disk
would
> just
> >>>>>> fail
> >>>>>>> immediately again, or it would appear to work and then fail
after a
> >>>>> short
> >>>>>>> amount of time.
> >>>>>>>
> >>>>>>> The way the disk failed was normally having an I/O request
take a
> >>>>> really
> >>>>>>> long time and time out.  So a bunch of request handler threads
> would
> >>>>>>> basically slam into a brick wall when they tried to access
the bad
> >>>>> disk,
> >>>>>>> slowing the DataNode to a crawl.  It was even worse in the
second
> >>>>>> scenario,
> >>>>>>> if the disk appeared to work for a while, but then failed.
 Any
> data
> >>>>> that
> >>>>>>> had been written on that DataNode to that disk would be
lost, and
> we
> >>>>>> would
> >>>>>>> need to re-replicate it.
> >>>>>>>
> >>>>>>> Disks aren't biological systems-- they don't heal over time.
 Once
> >>>>>> they're
> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust
against
> cases
> >>>>>> where
> >>>>>>> the disk really is failing, and really is returning bad
data or
> >>> timing
> >>>>>> out.
> >>>>>>>> * For your metrics, can you spell out the full metric
in JMX-style
> >>>>>>>> format, such as:
> >>>>>>>>
> >>>>>>
> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>                 value=4
> >>>>>>>>
> >>>>>>>> * For "uncleanable-partitions": topic-partition names
can be very
> >>>>> long.
> >>>>>>>> I think the current max size is 210 characters (or maybe
> 240-ish?).
> >>>>>>>> Having the "uncleanable-partitions" being a list could
be very
> large
> >>>>>>>> metric. Also, having the metric come out as a csv might
be
> difficult
> >>>>> to
> >>>>>>>> work with for monitoring systems. If we *did* want the
topic names
> >>> to
> >>>>>> be
> >>>>>>>> accessible, what do you think of having the
> >>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>> I'm not sure if LogCleanerManager is the right type,
but my
> example
> >>>>> was
> >>>>>>>> that the topic and partition can be tags in the metric.
That will
> >>>>> allow
> >>>>>>>> monitoring systems to more easily slice and dice the
metric. I'm
> not
> >>>>>>>> sure what the attribute for that metric would be. Maybe
something
> >>>>> like
> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
> >>> time-since-last-clean?
> >>>>>> Or
> >>>>>>>> maybe even just "Value=1".
> >>>>>>> I haven't though about this that hard, but do we really
need the
> >>>>>>> uncleanable topic names to be accessible through a metric?
 It
> seems
> >>>>> like
> >>>>>>> the admin should notice that uncleanable partitions are
present,
> and
> >>>>> then
> >>>>>>> check the logs?
> >>>>>>>
> >>>>>>>> * About `max.uncleanable.partitions`, you said that
this likely
> >>>>>>>> indicates that the disk is having problems. I'm not
sure that is
> the
> >>>>>>>> case. For the 4 JIRAs that you mentioned about log cleaner
> problems,
> >>>>>> all
> >>>>>>>> of them are partition-level scenarios that happened
during normal
> >>>>>>>> operation. None of them were indicative of disk problems.
> >>>>>>> I don't think this is a meaningful comparison.  In general,
we
> don't
> >>>>>>> accept JIRAs for hard disk problems that happen on a particular
> >>>>> cluster.
> >>>>>>> If someone opened a JIRA that said "my hard disk is having
> problems"
> >>> we
> >>>>>>> could close that as "not a Kafka bug."  This doesn't prove
that
> disk
> >>>>>>> problems don't happen, but  just that JIRA isn't the right
place
> for
> >>>>>> them.
> >>>>>>> I do agree that the log cleaner has had a significant number
of
> logic
> >>>>>>> bugs, and that we need to be careful to limit their impact.
 That's
> >>> one
> >>>>>>> reason why I think that a threshold of "number of uncleanable
logs"
> >>> is
> >>>>> a
> >>>>>>> good idea, rather than just failing after one IOException.
 In all
> >>> the
> >>>>>>> cases I've seen where a user hit a logic bug in the log
cleaner, it
> >>> was
> >>>>>>> just one partition that had the issue.  We also should increase
> test
> >>>>>>> coverage for the log cleaner.
> >>>>>>>
> >>>>>>>> * About marking disks as offline when exceeding a certain
> threshold,
> >>>>>>>> that actually increases the blast radius of log compaction
> failures.
> >>>>>>>> Currently, the uncleaned partitions are still readable
and
> writable.
> >>>>>>>> Taking the disks offline would impact availability of
the
> >>> uncleanable
> >>>>>>>> partitions, as well as impact all other partitions that
are on the
> >>>>>> disk.
> >>>>>>> In general, when we encounter I/O errors, we take the disk
> partition
> >>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>> ) :
> >>>>>>>
> >>>>>>>> - Broker assumes a log directory to be good after it
starts, and
> >>> mark
> >>>>>>> log directory as
> >>>>>>>> bad once there is IOException when broker attempts to
access (i.e.
> >>>>> read
> >>>>>>> or write) the log directory.
> >>>>>>>> - Broker will be offline if all log directories are
bad.
> >>>>>>>> - Broker will stop serving replicas in any bad log directory.
New
> >>>>>>> replicas will only be created
> >>>>>>>> on good log directory.
> >>>>>>> The behavior Stanislav is proposing for the log cleaner
is actually
> >>>>> more
> >>>>>>> optimistic than what we do for regular broker I/O, since
we will
> >>>>> tolerate
> >>>>>>> multiple IOExceptions, not just one.  But it's generally
> consistent.
> >>>>>>> Ignoring errors is not.  In any case, if you want to tolerate
an
> >>>>>> unlimited
> >>>>>>> number of I/O errors, you can just set the threshold to
an infinite
> >>>>> value
> >>>>>>> (although I think that would be a bad idea).
> >>>>>>>
> >>>>>>> best,
> >>>>>>> Colin
> >>>>>>>
> >>>>>>>> -James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski
<
> >>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>> I renamed the KIP and that changed the link. Sorry
about that.
> Here
> >>>>>> is
> >>>>>>> the
> >>>>>>>>> new link:
> >>>>>>>>>
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski
<
> >>>>>>> stanislav@confluent.io>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hey group,
> >>>>>>>>>>
> >>>>>>>>>> I created a new KIP about making log compaction
more
> >>>>> fault-tolerant.
> >>>>>>>>>> Please give it a look here and please share
what you think,
> >>>>>>> especially in
> >>>>>>>>>> regards to the points in the "Needs Discussion"
paragraph.
> >>>>>>>>>>
> >>>>>>>>>> KIP: KIP-346
> >>>>>>>>>> <
> >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>> --
> >>>>>>>>>> Best,
> >>>>>>>>>> Stanislav
> >>>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Best,
> >>>>>>>>> Stanislav
> >>>
> >> --
> >> Best,
> >> Stanislav
> >>
> >
>
>

-- 
Best,
Stanislav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message