kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin McCabe <cmcc...@apache.org>
Subject Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure
Date Thu, 02 Aug 2018 16:55:53 GMT
On Wed, Aug 1, 2018, at 11:35, James Cheng wrote:
> I’m a little confused about something. Is this KIP focused on log 
> cleaner exceptions in general, or focused on log cleaner exceptions due 
> to disk failures?
> 
> Will max.uncleanable.partitions apply to all exceptions (including log 
> cleaner logic errors) or will it apply to only disk I/o exceptions?

There is no difference between "log cleaner exceptions in general" and "log cleaner exceptions
due to disk failures."

For example, if the data on disk is corrupted we might read a 4-byte size as -1 instead of
100.  Then we would get a BufferUnderFlowException later on.  This is a subclass of RuntimeException
rather than IOException, of course, but it does result from a disk problem.  Or we might get
exceptions while validating checksums, which may or may not be IOE (I haven't looked).

Of course, the log cleaner itself may have a bug, which results in it throwing an exception
even if the disk does not have a problem.  We clearly want to fix these bugs.  But there's
no way for the program itself to know that it has a bug and act differently.  If an exception
occurs, we must assume there is a disk problem.

> 
> I can understand taking the disk offline if there have been “N” I/O 
> exceptions. Disk errors are user fixable (by replacing the affected 
> disk). It turns an invisible (soft?) failure into a visible hard 
> failure. And the I/O exceptions are possibly already causing problems, 
> so it makes sense to limit their impact.
> 
> But I’m not sure if it makes sense to take a disk offline after”N” logic 
> errors in the log cleaner. If a log cleaner logic error happens, it’s 
> rarely user fixable. And it will likely several partitions at once, so 
> you’re likely to bump up against the max.uncleanable.partitions limit 
> more quickly. If a disk was taken due to logic errors, I’m not sure what 
> the user would do.

I don't agree that log cleaner bugs "will likely [affect] several partitions at once."  Most
of the ones I've looked at only affect one or two partitions.  In particular the ones that
resulted from over-eagerness to use 32-bit math on 64-bit values.

If the log cleaner is so buggy that it's useless (the scenario you're describing), and you
want to put off an upgrade, then you can set max.uncleanable.partitions to the maximum value
to ignore failures.

best,
Colin


> 
> -James
> 
> Sent from my iPhone
> 
> > On Aug 1, 2018, at 9:11 AM, Stanislav Kozlovski <stanislav@confluent.io> wrote:
> > 
> > Yes, good catch. Thank you, James!
> > 
> > Best,
> > Stanislav
> > 
> >> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushujames@gmail.com> wrote:
> >> 
> >> Can you update the KIP to say what the default is for
> >> max.uncleanable.partitions?
> >> 
> >> -James
> >> 
> >> Sent from my iPhone
> >> 
> >>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <stanislav@confluent.io>
> >> wrote:
> >>> 
> >>> Hey group,
> >>> 
> >>> I am planning on starting a voting thread tomorrow. Please do reply if
> >> you
> >>> feel there is anything left to discuss.
> >>> 
> >>> Best,
> >>> Stanislav
> >>> 
> >>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
> >> stanislav@confluent.io>
> >>> wrote:
> >>> 
> >>>> Hey, Ray
> >>>> 
> >>>> Thanks for pointing that out, it's fixed now
> >>>> 
> >>>> Best,
> >>>> Stanislav
> >>>> 
> >>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchiang@apache.org>
wrote:
> >>>>> 
> >>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table
on
> >>>>> the main KIP landing page
> >>>>> <
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
> >>> ?
> >>>>> 
> >>>>> I tried, but the Wiki won't let me.
> >>>>> 
> >>>>> -Ray
> >>>>> 
> >>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
> >>>>>> Hey guys,
> >>>>>> 
> >>>>>> @Colin - good point. I added some sentences mentioning recent
> >>>>> improvements
> >>>>>> in the introductory section.
> >>>>>> 
> >>>>>> *Disk Failure* - I tend to agree with what Colin said - once
a disk
> >>>>> fails,
> >>>>>> you don't want to work with it again. As such, I've changed
my mind
> >> and
> >>>>>> believe that we should mark the LogDir (assume its a disk) as
offline
> >> on
> >>>>>> the first `IOException` encountered. This is the LogCleaner's
current
> >>>>>> behavior. We shouldn't change that.
> >>>>>> 
> >>>>>> *Respawning Threads* - I believe we should never re-spawn a
thread.
> >> The
> >>>>>> correct approach in my mind is to either have it stay dead or
never
> >> let
> >>>>> it
> >>>>>> die in the first place.
> >>>>>> 
> >>>>>> *Uncleanable-partition-names metric* - Colin is right, this
metric is
> >>>>>> unneeded. Users can monitor the `uncleanable-partitions-count`
metric
> >>>>> and
> >>>>>> inspect logs.
> >>>>>> 
> >>>>>> 
> >>>>>> Hey Ray,
> >>>>>> 
> >>>>>>> 2) I'm 100% with James in agreement with setting up the
LogCleaner to
> >>>>>>> skip over problematic partitions instead of dying.
> >>>>>> I think we can do this for every exception that isn't `IOException`.
> >>>>> This
> >>>>>> will future-proof us against bugs in the system and potential
other
> >>>>> errors.
> >>>>>> Protecting yourself against unexpected failures is always a
good thing
> >>>>> in
> >>>>>> my mind, but I also think that protecting yourself against bugs
in the
> >>>>>> software is sort of clunky. What does everybody think about
this?
> >>>>>> 
> >>>>>>> 4) The only improvement I can think of is that if such an
> >>>>>>> error occurs, then have the option (configuration setting?)
to
> >> create a
> >>>>>>> <log_segment>.skip file (or something similar).
> >>>>>> This is a good suggestion. Have others also seen corruption
be
> >> generally
> >>>>>> tied to the same segment?
> >>>>>> 
> >>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io>
> >>>>> wrote:
> >>>>>> 
> >>>>>>> For the cleaner thread specifically, I do not think respawning
will
> >>>>> help at
> >>>>>>> all because we are more than likely to run into the same
issue again
> >>>>> which
> >>>>>>> would end up crashing the cleaner. Retrying makes sense
for transient
> >>>>>>> errors or when you believe some part of the system could
have healed
> >>>>>>> itself, both of which I think are not true for the log cleaner.
> >>>>>>> 
> >>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndgstn@gmail.com>
> >>>>> wrote:
> >>>>>>> 
> >>>>>>>> <<<respawning threads is likely to make things
worse, by putting you
> >>>>> in
> >>>>>>> an
> >>>>>>>> infinite loop which consumes resources and fires off
continuous log
> >>>>>>>> messages.
> >>>>>>>> Hi Colin.  In case it could be relevant, one way to
mitigate this
> >>>>> effect
> >>>>>>> is
> >>>>>>>> to implement a backoff mechanism (if a second respawn
is to occur
> >> then
> >>>>>>> wait
> >>>>>>>> for 1 minute before doing it; then if a third respawn
is to occur
> >> wait
> >>>>>>> for
> >>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes,
etc. up to
> >> some
> >>>>> max
> >>>>>>>> wait time).
> >>>>>>>> 
> >>>>>>>> I have no opinion on whether respawn is appropriate
or not in this
> >>>>>>> context,
> >>>>>>>> but a mitigation like the increasing backoff described
above may be
> >>>>>>>> relevant in weighing the pros and cons.
> >>>>>>>> 
> >>>>>>>> Ron
> >>>>>>>> 
> >>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmccabe@apache.org>
> >>>>> wrote:
> >>>>>>>> 
> >>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng
wrote:
> >>>>>>>>>> Hi Stanislav! Thanks for this KIP!
> >>>>>>>>>> 
> >>>>>>>>>> I agree that it would be good if the LogCleaner
were more tolerant
> >>>>> of
> >>>>>>>>>> errors. Currently, as you said, once it dies,
it stays dead.
> >>>>>>>>>> 
> >>>>>>>>>> Things are better now than they used to be.
We have the metric
> >>>>>>>>>>      kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
> >>>>>>>>>> which we can use to tell us if the threads are
dead. And as of
> >>>>> 1.1.0,
> >>>>>>>> we
> >>>>>>>>>> have KIP-226, which allows you to restart the
log cleaner thread,
> >>>>>>>>>> without requiring a broker restart.
> >>>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>>> <
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
> >>>>>>>>> 
> >>>>>>>>>> I've only read about this, I haven't personally
tried it.
> >>>>>>>>> Thanks for pointing this out, James!  Stanislav,
we should probably
> >>>>>>> add a
> >>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere
in the
> >> KIP.
> >>>>>>>> Maybe
> >>>>>>>>> in the intro section?
> >>>>>>>>> 
> >>>>>>>>> I think it's clear that requiring the users to manually
restart the
> >>>>> log
> >>>>>>>>> cleaner is not a very good solution.  But it's good
to know that
> >>>>> it's a
> >>>>>>>>> possibility on some older releases.
> >>>>>>>>> 
> >>>>>>>>>> Some comments:
> >>>>>>>>>> * I like the idea of having the log cleaner
continue to clean as
> >>>>> many
> >>>>>>>>>> partitions as it can, skipping over the problematic
ones if
> >>>>> possible.
> >>>>>>>>>> 
> >>>>>>>>>> * If the log cleaner thread dies, I think it
should automatically
> >> be
> >>>>>>>>>> revived. Your KIP attempts to do that by catching
exceptions
> >> during
> >>>>>>>>>> execution, but I think we should go all the
way and make sure
> >> that a
> >>>>>>>> new
> >>>>>>>>>> one gets created, if the thread ever dies.
> >>>>>>>>> This is inconsistent with the way the rest of Kafka
works.  We
> >> don't
> >>>>>>>>> automatically re-create other threads in the broker
if they
> >>>>> terminate.
> >>>>>>>> In
> >>>>>>>>> general, if there is a serious bug in the code,
respawning threads
> >> is
> >>>>>>>>> likely to make things worse, by putting you in an
infinite loop
> >> which
> >>>>>>>>> consumes resources and fires off continuous log
messages.
> >>>>>>>>> 
> >>>>>>>>>> * It might be worth trying to re-clean the uncleanable
partitions.
> >>>>>>> I've
> >>>>>>>>>> seen cases where an uncleanable partition later
became cleanable.
> >> I
> >>>>>>>>>> unfortunately don't remember how that happened,
but I remember
> >> being
> >>>>>>>>>> surprised when I discovered it. It might have
been something like
> >> a
> >>>>>>>>>> follower was uncleanable but after a leader
election happened, the
> >>>>>>> log
> >>>>>>>>>> truncated and it was then cleanable again. I'm
not sure.
> >>>>>>>>> James, I disagree.  We had this behavior in the
Hadoop Distributed
> >>>>> File
> >>>>>>>>> System (HDFS) and it was a constant source of user
problems.
> >>>>>>>>> 
> >>>>>>>>> What would happen is disks would just go bad over
time.  The
> >> DataNode
> >>>>>>>>> would notice this and take them offline.  But then,
due to some
> >>>>>>>>> "optimistic" code, the DataNode would periodically
try to re-add
> >> them
> >>>>>>> to
> >>>>>>>>> the system.  Then one of two things would happen:
the disk would
> >> just
> >>>>>>>> fail
> >>>>>>>>> immediately again, or it would appear to work and
then fail after a
> >>>>>>> short
> >>>>>>>>> amount of time.
> >>>>>>>>> 
> >>>>>>>>> The way the disk failed was normally having an I/O
request take a
> >>>>>>> really
> >>>>>>>>> long time and time out.  So a bunch of request handler
threads
> >> would
> >>>>>>>>> basically slam into a brick wall when they tried
to access the bad
> >>>>>>> disk,
> >>>>>>>>> slowing the DataNode to a crawl.  It was even worse
in the second
> >>>>>>>> scenario,
> >>>>>>>>> if the disk appeared to work for a while, but then
failed.  Any
> >> data
> >>>>>>> that
> >>>>>>>>> had been written on that DataNode to that disk would
be lost, and
> >> we
> >>>>>>>> would
> >>>>>>>>> need to re-replicate it.
> >>>>>>>>> 
> >>>>>>>>> Disks aren't biological systems-- they don't heal
over time.  Once
> >>>>>>>> they're
> >>>>>>>>> bad, they stay bad.  The log cleaner needs to be
robust against
> >> cases
> >>>>>>>> where
> >>>>>>>>> the disk really is failing, and really is returning
bad data or
> >>>>> timing
> >>>>>>>> out.
> >>>>>>>>>> * For your metrics, can you spell out the full
metric in JMX-style
> >>>>>>>>>> format, such as:
> >>>>>>>>>> 
> >>>>>>>> kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
> >>>>>>>>>>              value=4
> >>>>>>>>>> 
> >>>>>>>>>> * For "uncleanable-partitions": topic-partition
names can be very
> >>>>>>> long.
> >>>>>>>>>> I think the current max size is 210 characters
(or maybe
> >> 240-ish?).
> >>>>>>>>>> Having the "uncleanable-partitions" being a
list could be very
> >> large
> >>>>>>>>>> metric. Also, having the metric come out as
a csv might be
> >> difficult
> >>>>>>> to
> >>>>>>>>>> work with for monitoring systems. If we *did*
want the topic names
> >>>>> to
> >>>>>>>> be
> >>>>>>>>>> accessible, what do you think of having the
> >>>>>>>>>>      kafka.log:type=LogCleanerManager,topic=topic1,partition=2
> >>>>>>>>>> I'm not sure if LogCleanerManager is the right
type, but my
> >> example
> >>>>>>> was
> >>>>>>>>>> that the topic and partition can be tags in
the metric. That will
> >>>>>>> allow
> >>>>>>>>>> monitoring systems to more easily slice and
dice the metric. I'm
> >> not
> >>>>>>>>>> sure what the attribute for that metric would
be. Maybe something
> >>>>>>> like
> >>>>>>>>>> "uncleaned bytes" for that topic-partition?
Or
> >>>>> time-since-last-clean?
> >>>>>>>> Or
> >>>>>>>>>> maybe even just "Value=1".
> >>>>>>>>> I haven't though about this that hard, but do we
really need the
> >>>>>>>>> uncleanable topic names to be accessible through
a metric?  It
> >> seems
> >>>>>>> like
> >>>>>>>>> the admin should notice that uncleanable partitions
are present,
> >> and
> >>>>>>> then
> >>>>>>>>> check the logs?
> >>>>>>>>> 
> >>>>>>>>>> * About `max.uncleanable.partitions`, you said
that this likely
> >>>>>>>>>> indicates that the disk is having problems.
I'm not sure that is
> >> the
> >>>>>>>>>> case. For the 4 JIRAs that you mentioned about
log cleaner
> >> problems,
> >>>>>>>> all
> >>>>>>>>>> of them are partition-level scenarios that happened
during normal
> >>>>>>>>>> operation. None of them were indicative of disk
problems.
> >>>>>>>>> I don't think this is a meaningful comparison. 
In general, we
> >> don't
> >>>>>>>>> accept JIRAs for hard disk problems that happen
on a particular
> >>>>>>> cluster.
> >>>>>>>>> If someone opened a JIRA that said "my hard disk
is having
> >> problems"
> >>>>> we
> >>>>>>>>> could close that as "not a Kafka bug."  This doesn't
prove that
> >> disk
> >>>>>>>>> problems don't happen, but  just that JIRA isn't
the right place
> >> for
> >>>>>>>> them.
> >>>>>>>>> I do agree that the log cleaner has had a significant
number of
> >> logic
> >>>>>>>>> bugs, and that we need to be careful to limit their
impact.  That's
> >>>>> one
> >>>>>>>>> reason why I think that a threshold of "number of
uncleanable logs"
> >>>>> is
> >>>>>>> a
> >>>>>>>>> good idea, rather than just failing after one IOException.
 In all
> >>>>> the
> >>>>>>>>> cases I've seen where a user hit a logic bug in
the log cleaner, it
> >>>>> was
> >>>>>>>>> just one partition that had the issue.  We also
should increase
> >> test
> >>>>>>>>> coverage for the log cleaner.
> >>>>>>>>> 
> >>>>>>>>>> * About marking disks as offline when exceeding
a certain
> >> threshold,
> >>>>>>>>>> that actually increases the blast radius of
log compaction
> >> failures.
> >>>>>>>>>> Currently, the uncleaned partitions are still
readable and
> >> writable.
> >>>>>>>>>> Taking the disks offline would impact availability
of the
> >>>>> uncleanable
> >>>>>>>>>> partitions, as well as impact all other partitions
that are on the
> >>>>>>>> disk.
> >>>>>>>>> In general, when we encounter I/O errors, we take
the disk
> >> partition
> >>>>>>>>> offline.  This is spelled out in KIP-112 (
> >>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
> >>>>>>>>> ) :
> >>>>>>>>> 
> >>>>>>>>>> - Broker assumes a log directory to be good
after it starts, and
> >>>>> mark
> >>>>>>>>> log directory as
> >>>>>>>>>> bad once there is IOException when broker attempts
to access (i.e.
> >>>>>>> read
> >>>>>>>>> or write) the log directory.
> >>>>>>>>>> - Broker will be offline if all log directories
are bad.
> >>>>>>>>>> - Broker will stop serving replicas in any bad
log directory. New
> >>>>>>>>> replicas will only be created
> >>>>>>>>>> on good log directory.
> >>>>>>>>> The behavior Stanislav is proposing for the log
cleaner is actually
> >>>>>>> more
> >>>>>>>>> optimistic than what we do for regular broker I/O,
since we will
> >>>>>>> tolerate
> >>>>>>>>> multiple IOExceptions, not just one.  But it's generally
> >> consistent.
> >>>>>>>>> Ignoring errors is not.  In any case, if you want
to tolerate an
> >>>>>>>> unlimited
> >>>>>>>>> number of I/O errors, you can just set the threshold
to an infinite
> >>>>>>> value
> >>>>>>>>> (although I think that would be a bad idea).
> >>>>>>>>> 
> >>>>>>>>> best,
> >>>>>>>>> Colin
> >>>>>>>>> 
> >>>>>>>>>> -James
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski
<
> >>>>>>>>> stanislav@confluent.io> wrote:
> >>>>>>>>>>> I renamed the KIP and that changed the link.
Sorry about that.
> >> Here
> >>>>>>>> is
> >>>>>>>>> the
> >>>>>>>>>>> new link:
> >>>>>>>>>>> 
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
> >>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav
Kozlovski <
> >>>>>>>>> stanislav@confluent.io>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>> Hey group,
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I created a new KIP about making log
compaction more
> >>>>>>> fault-tolerant.
> >>>>>>>>>>>> Please give it a look here and please
share what you think,
> >>>>>>>>> especially in
> >>>>>>>>>>>> regards to the points in the "Needs
Discussion" paragraph.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> KIP: KIP-346
> >>>>>>>>>>>> <
> >>>>>>> 
> >>>>> 
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
> >>>>>>>>>>>> --
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Stanislav
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> --
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Stanislav
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> --
> >>>> Best,
> >>>> Stanislav
> >>>> 
> >>> 
> >>> 
> >>> --
> >>> Best,
> >>> Stanislav
> >> 
> > 
> > 
> > -- 
> > Best,
> > Stanislav

Mime
View raw message