kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stanislav Kozlovski <stanis...@confluent.io>
Subject Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure
Date Wed, 01 Aug 2018 15:33:41 GMT
Hey group,

I just wanted to note that I have an implementation ready for review. Feel
free to take a quick look and raise any concerns you might have in due
time. I plan on starting the voting thread tomorrow.

Best,
Stanislav

On Wed, Aug 1, 2018 at 10:01 AM Stanislav Kozlovski <stanislav@confluent.io>
wrote:

> Hey Ray,
>
> * Yes, we'd have the logDir as a tag in the metric
> * The intention is to have Int.MaxValue as the maximum uncleanable
> partitions count
> * My idea is to store the marked logs (actually partitions) in memory
> instead of the ".skip" files to keep the change simple. I have also decided
> to omit any retries from the implementation - once a partition is marked as
> "uncleanable" it stays so until a broker restart
>
> Please do let me know if you are okay with this description. I should have
> the code available for review soon
>
> Thanks,
> Stanislav
>
> On Tue, Jul 31, 2018 at 6:30 PM Ray Chiang <rchiang@apache.org> wrote:
>
>> I had one question that I was trying to do some investigation before I
>> asked, but I'm having some issues with my JMX browser right now.
>>
>>   * For the uncleanable-partitions-count metric, is that going to be
>>     per-logDir entry?
>>   * For max.uncleanable.partitions, is the intention to have -1 be
>>     "infinite" or are we going to use Int.MaxValue as a practical
>>     equivalent?
>>   * In this sentence: "When evaluating which logs to compact, skip the
>>     marked ones.", should we define what "marking" will be?  If we're
>>     going with the ".skip" file or equivalent, can we also add how
>>     successful retries will behave?
>>
>> -Ray
>>
>> On 7/31/18 9:56 AM, Stanislav Kozlovski wrote:
>> > Hey group,
>> >
>> > I am planning on starting a voting thread tomorrow. Please do reply if
>> you
>> > feel there is anything left to discuss.
>> >
>> > Best,
>> > Stanislav
>> >
>> > On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> stanislav@confluent.io>
>> > wrote:
>> >
>> >> Hey, Ray
>> >>
>> >> Thanks for pointing that out, it's fixed now
>> >>
>> >> Best,
>> >> Stanislav
>> >>
>> >> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchiang@apache.org> wrote:
>> >>
>> >>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>> >>> the main KIP landing page
>> >>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>> >?
>> >>>
>> >>> I tried, but the Wiki won't let me.
>> >>>
>> >>> -Ray
>> >>>
>> >>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>> >>>> Hey guys,
>> >>>>
>> >>>> @Colin - good point. I added some sentences mentioning recent
>> >>> improvements
>> >>>> in the introductory section.
>> >>>>
>> >>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>> >>> fails,
>> >>>> you don't want to work with it again. As such, I've changed my mind
>> and
>> >>>> believe that we should mark the LogDir (assume its a disk) as
>> offline on
>> >>>> the first `IOException` encountered. This is the LogCleaner's current
>> >>>> behavior. We shouldn't change that.
>> >>>>
>> >>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>> The
>> >>>> correct approach in my mind is to either have it stay dead or never
>> let
>> >>> it
>> >>>> die in the first place.
>> >>>>
>> >>>> *Uncleanable-partition-names metric* - Colin is right, this metric
is
>> >>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>> >>> and
>> >>>> inspect logs.
>> >>>>
>> >>>>
>> >>>> Hey Ray,
>> >>>>
>> >>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
>> to
>> >>>>> skip over problematic partitions instead of dying.
>> >>>> I think we can do this for every exception that isn't `IOException`.
>> >>> This
>> >>>> will future-proof us against bugs in the system and potential other
>> >>> errors.
>> >>>> Protecting yourself against unexpected failures is always a good
>> thing
>> >>> in
>> >>>> my mind, but I also think that protecting yourself against bugs
in
>> the
>> >>>> software is sort of clunky. What does everybody think about this?
>> >>>>
>> >>>>> 4) The only improvement I can think of is that if such an
>> >>>>> error occurs, then have the option (configuration setting?)
to
>> create a
>> >>>>> <log_segment>.skip file (or something similar).
>> >>>> This is a good suggestion. Have others also seen corruption be
>> generally
>> >>>> tied to the same segment?
>> >>>>
>> >>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io>
>> >>> wrote:
>> >>>>> For the cleaner thread specifically, I do not think respawning
will
>> >>> help at
>> >>>>> all because we are more than likely to run into the same issue
again
>> >>> which
>> >>>>> would end up crashing the cleaner. Retrying makes sense for
>> transient
>> >>>>> errors or when you believe some part of the system could have
healed
>> >>>>> itself, both of which I think are not true for the log cleaner.
>> >>>>>
>> >>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndgstn@gmail.com>
>> >>> wrote:
>> >>>>>> <<<respawning threads is likely to make things
worse, by putting
>> you
>> >>> in
>> >>>>> an
>> >>>>>> infinite loop which consumes resources and fires off continuous
log
>> >>>>>> messages.
>> >>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
this
>> >>> effect
>> >>>>> is
>> >>>>>> to implement a backoff mechanism (if a second respawn is
to occur
>> then
>> >>>>> wait
>> >>>>>> for 1 minute before doing it; then if a third respawn is
to occur
>> wait
>> >>>>> for
>> >>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc.
up to
>> some
>> >>> max
>> >>>>>> wait time).
>> >>>>>>
>> >>>>>> I have no opinion on whether respawn is appropriate or not
in this
>> >>>>> context,
>> >>>>>> but a mitigation like the increasing backoff described above
may be
>> >>>>>> relevant in weighing the pros and cons.
>> >>>>>>
>> >>>>>> Ron
>> >>>>>>
>> >>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmccabe@apache.org>
>> >>> wrote:
>> >>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>> >>>>>>>> Hi Stanislav! Thanks for this KIP!
>> >>>>>>>>
>> >>>>>>>> I agree that it would be good if the LogCleaner
were more
>> tolerant
>> >>> of
>> >>>>>>>> errors. Currently, as you said, once it dies, it
stays dead.
>> >>>>>>>>
>> >>>>>>>> Things are better now than they used to be. We have
the metric
>> >>>>>>>>
>>  kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>> >>>>>>>> which we can use to tell us if the threads are dead.
And as of
>> >>> 1.1.0,
>> >>>>>> we
>> >>>>>>>> have KIP-226, which allows you to restart the log
cleaner thread,
>> >>>>>>>> without requiring a broker restart.
>> >>>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>>>>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>> >>>>>>>> I've only read about this, I haven't personally
tried it.
>> >>>>>>> Thanks for pointing this out, James!  Stanislav, we
should
>> probably
>> >>>>> add a
>> >>>>>>> sentence or two mentioning the KIP-226 changes somewhere
in the
>> KIP.
>> >>>>>> Maybe
>> >>>>>>> in the intro section?
>> >>>>>>>
>> >>>>>>> I think it's clear that requiring the users to manually
restart
>> the
>> >>> log
>> >>>>>>> cleaner is not a very good solution.  But it's good
to know that
>> >>> it's a
>> >>>>>>> possibility on some older releases.
>> >>>>>>>
>> >>>>>>>> Some comments:
>> >>>>>>>> * I like the idea of having the log cleaner continue
to clean as
>> >>> many
>> >>>>>>>> partitions as it can, skipping over the problematic
ones if
>> >>> possible.
>> >>>>>>>> * If the log cleaner thread dies, I think it should
>> automatically be
>> >>>>>>>> revived. Your KIP attempts to do that by catching
exceptions
>> during
>> >>>>>>>> execution, but I think we should go all the way
and make sure
>> that a
>> >>>>>> new
>> >>>>>>>> one gets created, if the thread ever dies.
>> >>>>>>> This is inconsistent with the way the rest of Kafka
works.  We
>> don't
>> >>>>>>> automatically re-create other threads in the broker
if they
>> >>> terminate.
>> >>>>>> In
>> >>>>>>> general, if there is a serious bug in the code, respawning
>> threads is
>> >>>>>>> likely to make things worse, by putting you in an infinite
loop
>> which
>> >>>>>>> consumes resources and fires off continuous log messages.
>> >>>>>>>
>> >>>>>>>> * It might be worth trying to re-clean the uncleanable
>> partitions.
>> >>>>> I've
>> >>>>>>>> seen cases where an uncleanable partition later
became
>> cleanable. I
>> >>>>>>>> unfortunately don't remember how that happened,
but I remember
>> being
>> >>>>>>>> surprised when I discovered it. It might have been
something
>> like a
>> >>>>>>>> follower was uncleanable but after a leader election
happened,
>> the
>> >>>>> log
>> >>>>>>>> truncated and it was then cleanable again. I'm not
sure.
>> >>>>>>> James, I disagree.  We had this behavior in the Hadoop
Distributed
>> >>> File
>> >>>>>>> System (HDFS) and it was a constant source of user problems.
>> >>>>>>>
>> >>>>>>> What would happen is disks would just go bad over time.
 The
>> DataNode
>> >>>>>>> would notice this and take them offline.  But then,
due to some
>> >>>>>>> "optimistic" code, the DataNode would periodically try
to re-add
>> them
>> >>>>> to
>> >>>>>>> the system.  Then one of two things would happen: the
disk would
>> just
>> >>>>>> fail
>> >>>>>>> immediately again, or it would appear to work and then
fail after
>> a
>> >>>>> short
>> >>>>>>> amount of time.
>> >>>>>>>
>> >>>>>>> The way the disk failed was normally having an I/O request
take a
>> >>>>> really
>> >>>>>>> long time and time out.  So a bunch of request handler
threads
>> would
>> >>>>>>> basically slam into a brick wall when they tried to
access the bad
>> >>>>> disk,
>> >>>>>>> slowing the DataNode to a crawl.  It was even worse
in the second
>> >>>>>> scenario,
>> >>>>>>> if the disk appeared to work for a while, but then failed.
 Any
>> data
>> >>>>> that
>> >>>>>>> had been written on that DataNode to that disk would
be lost, and
>> we
>> >>>>>> would
>> >>>>>>> need to re-replicate it.
>> >>>>>>>
>> >>>>>>> Disks aren't biological systems-- they don't heal over
time.  Once
>> >>>>>> they're
>> >>>>>>> bad, they stay bad.  The log cleaner needs to be robust
against
>> cases
>> >>>>>> where
>> >>>>>>> the disk really is failing, and really is returning
bad data or
>> >>> timing
>> >>>>>> out.
>> >>>>>>>> * For your metrics, can you spell out the full metric
in
>> JMX-style
>> >>>>>>>> format, such as:
>> >>>>>>>>
>> >>>>>>
>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>> >>>>>>>>                 value=4
>> >>>>>>>>
>> >>>>>>>> * For "uncleanable-partitions": topic-partition
names can be very
>> >>>>> long.
>> >>>>>>>> I think the current max size is 210 characters (or
maybe
>> 240-ish?).
>> >>>>>>>> Having the "uncleanable-partitions" being a list
could be very
>> large
>> >>>>>>>> metric. Also, having the metric come out as a csv
might be
>> difficult
>> >>>>> to
>> >>>>>>>> work with for monitoring systems. If we *did* want
the topic
>> names
>> >>> to
>> >>>>>> be
>> >>>>>>>> accessible, what do you think of having the
>> >>>>>>>>         kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>> >>>>>>>> I'm not sure if LogCleanerManager is the right type,
but my
>> example
>> >>>>> was
>> >>>>>>>> that the topic and partition can be tags in the
metric. That will
>> >>>>> allow
>> >>>>>>>> monitoring systems to more easily slice and dice
the metric. I'm
>> not
>> >>>>>>>> sure what the attribute for that metric would be.
Maybe something
>> >>>>> like
>> >>>>>>>> "uncleaned bytes" for that topic-partition? Or
>> >>> time-since-last-clean?
>> >>>>>> Or
>> >>>>>>>> maybe even just "Value=1".
>> >>>>>>> I haven't though about this that hard, but do we really
need the
>> >>>>>>> uncleanable topic names to be accessible through a metric?
 It
>> seems
>> >>>>> like
>> >>>>>>> the admin should notice that uncleanable partitions
are present,
>> and
>> >>>>> then
>> >>>>>>> check the logs?
>> >>>>>>>
>> >>>>>>>> * About `max.uncleanable.partitions`, you said that
this likely
>> >>>>>>>> indicates that the disk is having problems. I'm
not sure that is
>> the
>> >>>>>>>> case. For the 4 JIRAs that you mentioned about log
cleaner
>> problems,
>> >>>>>> all
>> >>>>>>>> of them are partition-level scenarios that happened
during normal
>> >>>>>>>> operation. None of them were indicative of disk
problems.
>> >>>>>>> I don't think this is a meaningful comparison.  In general,
we
>> don't
>> >>>>>>> accept JIRAs for hard disk problems that happen on a
particular
>> >>>>> cluster.
>> >>>>>>> If someone opened a JIRA that said "my hard disk is
having
>> problems"
>> >>> we
>> >>>>>>> could close that as "not a Kafka bug."  This doesn't
prove that
>> disk
>> >>>>>>> problems don't happen, but  just that JIRA isn't the
right place
>> for
>> >>>>>> them.
>> >>>>>>> I do agree that the log cleaner has had a significant
number of
>> logic
>> >>>>>>> bugs, and that we need to be careful to limit their
impact.
>> That's
>> >>> one
>> >>>>>>> reason why I think that a threshold of "number of uncleanable
>> logs"
>> >>> is
>> >>>>> a
>> >>>>>>> good idea, rather than just failing after one IOException.
 In all
>> >>> the
>> >>>>>>> cases I've seen where a user hit a logic bug in the
log cleaner,
>> it
>> >>> was
>> >>>>>>> just one partition that had the issue.  We also should
increase
>> test
>> >>>>>>> coverage for the log cleaner.
>> >>>>>>>
>> >>>>>>>> * About marking disks as offline when exceeding
a certain
>> threshold,
>> >>>>>>>> that actually increases the blast radius of log
compaction
>> failures.
>> >>>>>>>> Currently, the uncleaned partitions are still readable
and
>> writable.
>> >>>>>>>> Taking the disks offline would impact availability
of the
>> >>> uncleanable
>> >>>>>>>> partitions, as well as impact all other partitions
that are on
>> the
>> >>>>>> disk.
>> >>>>>>> In general, when we encounter I/O errors, we take the
disk
>> partition
>> >>>>>>> offline.  This is spelled out in KIP-112 (
>> >>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>> >>>>>>> ) :
>> >>>>>>>
>> >>>>>>>> - Broker assumes a log directory to be good after
it starts, and
>> >>> mark
>> >>>>>>> log directory as
>> >>>>>>>> bad once there is IOException when broker attempts
to access
>> (i.e.
>> >>>>> read
>> >>>>>>> or write) the log directory.
>> >>>>>>>> - Broker will be offline if all log directories
are bad.
>> >>>>>>>> - Broker will stop serving replicas in any bad log
directory. New
>> >>>>>>> replicas will only be created
>> >>>>>>>> on good log directory.
>> >>>>>>> The behavior Stanislav is proposing for the log cleaner
is
>> actually
>> >>>>> more
>> >>>>>>> optimistic than what we do for regular broker I/O, since
we will
>> >>>>> tolerate
>> >>>>>>> multiple IOExceptions, not just one.  But it's generally
>> consistent.
>> >>>>>>> Ignoring errors is not.  In any case, if you want to
tolerate an
>> >>>>>> unlimited
>> >>>>>>> number of I/O errors, you can just set the threshold
to an
>> infinite
>> >>>>> value
>> >>>>>>> (although I think that would be a bad idea).
>> >>>>>>>
>> >>>>>>> best,
>> >>>>>>> Colin
>> >>>>>>>
>> >>>>>>>> -James
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski
<
>> >>>>>>> stanislav@confluent.io> wrote:
>> >>>>>>>>> I renamed the KIP and that changed the link.
Sorry about that.
>> Here
>> >>>>>> is
>> >>>>>>> the
>> >>>>>>>>> new link:
>> >>>>>>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>> >>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski
<
>> >>>>>>> stanislav@confluent.io>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> Hey group,
>> >>>>>>>>>>
>> >>>>>>>>>> I created a new KIP about making log compaction
more
>> >>>>> fault-tolerant.
>> >>>>>>>>>> Please give it a look here and please share
what you think,
>> >>>>>>> especially in
>> >>>>>>>>>> regards to the points in the "Needs Discussion"
paragraph.
>> >>>>>>>>>>
>> >>>>>>>>>> KIP: KIP-346
>> >>>>>>>>>> <
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>> >>>>>>>>>> --
>> >>>>>>>>>> Best,
>> >>>>>>>>>> Stanislav
>> >>>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Best,
>> >>>>>>>>> Stanislav
>> >>>
>> >> --
>> >> Best,
>> >> Stanislav
>> >>
>> >
>>
>>
>
> --
> Best,
> Stanislav
>


-- 
Best,
Stanislav

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message