kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Chiang <rchi...@apache.org>
Subject Re: [DISCUSS] KIP-346 - Limit blast radius of log compaction failure
Date Thu, 02 Aug 2018 18:32:27 GMT
One more thing occurred to me.  Should the configuration property be 
named "max.uncleanable.partitions.per.disk" instead?

-Ray


On 8/1/18 9:11 AM, Stanislav Kozlovski wrote:
> Yes, good catch. Thank you, James!
>
> Best,
> Stanislav
>
> On Wed, Aug 1, 2018 at 5:05 PM James Cheng <wushujames@gmail.com> wrote:
>
>> Can you update the KIP to say what the default is for
>> max.uncleanable.partitions?
>>
>> -James
>>
>> Sent from my iPhone
>>
>>> On Jul 31, 2018, at 9:56 AM, Stanislav Kozlovski <stanislav@confluent.io>
>> wrote:
>>> Hey group,
>>>
>>> I am planning on starting a voting thread tomorrow. Please do reply if
>> you
>>> feel there is anything left to discuss.
>>>
>>> Best,
>>> Stanislav
>>>
>>> On Fri, Jul 27, 2018 at 11:05 PM Stanislav Kozlovski <
>> stanislav@confluent.io>
>>> wrote:
>>>
>>>> Hey, Ray
>>>>
>>>> Thanks for pointing that out, it's fixed now
>>>>
>>>> Best,
>>>> Stanislav
>>>>
>>>>> On Fri, Jul 27, 2018 at 9:43 PM Ray Chiang <rchiang@apache.org>
wrote:
>>>>>
>>>>> Thanks.  Can you fix the link in the "KIPs under discussion" table on
>>>>> the main KIP landing page
>>>>> <
>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals#
>>> ?
>>>>> I tried, but the Wiki won't let me.
>>>>>
>>>>> -Ray
>>>>>
>>>>>> On 7/26/18 2:01 PM, Stanislav Kozlovski wrote:
>>>>>> Hey guys,
>>>>>>
>>>>>> @Colin - good point. I added some sentences mentioning recent
>>>>> improvements
>>>>>> in the introductory section.
>>>>>>
>>>>>> *Disk Failure* - I tend to agree with what Colin said - once a disk
>>>>> fails,
>>>>>> you don't want to work with it again. As such, I've changed my mind
>> and
>>>>>> believe that we should mark the LogDir (assume its a disk) as offline
>> on
>>>>>> the first `IOException` encountered. This is the LogCleaner's current
>>>>>> behavior. We shouldn't change that.
>>>>>>
>>>>>> *Respawning Threads* - I believe we should never re-spawn a thread.
>> The
>>>>>> correct approach in my mind is to either have it stay dead or never
>> let
>>>>> it
>>>>>> die in the first place.
>>>>>>
>>>>>> *Uncleanable-partition-names metric* - Colin is right, this metric
is
>>>>>> unneeded. Users can monitor the `uncleanable-partitions-count` metric
>>>>> and
>>>>>> inspect logs.
>>>>>>
>>>>>>
>>>>>> Hey Ray,
>>>>>>
>>>>>>> 2) I'm 100% with James in agreement with setting up the LogCleaner
to
>>>>>>> skip over problematic partitions instead of dying.
>>>>>> I think we can do this for every exception that isn't `IOException`.
>>>>> This
>>>>>> will future-proof us against bugs in the system and potential other
>>>>> errors.
>>>>>> Protecting yourself against unexpected failures is always a good
thing
>>>>> in
>>>>>> my mind, but I also think that protecting yourself against bugs in
the
>>>>>> software is sort of clunky. What does everybody think about this?
>>>>>>
>>>>>>> 4) The only improvement I can think of is that if such an
>>>>>>> error occurs, then have the option (configuration setting?) to
>> create a
>>>>>>> <log_segment>.skip file (or something similar).
>>>>>> This is a good suggestion. Have others also seen corruption be
>> generally
>>>>>> tied to the same segment?
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:55 AM Dhruvil Shah <dhruvil@confluent.io>
>>>>> wrote:
>>>>>>> For the cleaner thread specifically, I do not think respawning
will
>>>>> help at
>>>>>>> all because we are more than likely to run into the same issue
again
>>>>> which
>>>>>>> would end up crashing the cleaner. Retrying makes sense for transient
>>>>>>> errors or when you believe some part of the system could have
healed
>>>>>>> itself, both of which I think are not true for the log cleaner.
>>>>>>>
>>>>>>> On Wed, Jul 25, 2018 at 11:08 AM Ron Dagostino <rndgstn@gmail.com>
>>>>> wrote:
>>>>>>>> <<<respawning threads is likely to make things worse,
by putting you
>>>>> in
>>>>>>> an
>>>>>>>> infinite loop which consumes resources and fires off continuous
log
>>>>>>>> messages.
>>>>>>>> Hi Colin.  In case it could be relevant, one way to mitigate
this
>>>>> effect
>>>>>>> is
>>>>>>>> to implement a backoff mechanism (if a second respawn is
to occur
>> then
>>>>>>> wait
>>>>>>>> for 1 minute before doing it; then if a third respawn is
to occur
>> wait
>>>>>>> for
>>>>>>>> 2 minutes before doing it; then 4 minutes, 8 minutes, etc.
up to
>> some
>>>>> max
>>>>>>>> wait time).
>>>>>>>>
>>>>>>>> I have no opinion on whether respawn is appropriate or not
in this
>>>>>>> context,
>>>>>>>> but a mitigation like the increasing backoff described above
may be
>>>>>>>> relevant in weighing the pros and cons.
>>>>>>>>
>>>>>>>> Ron
>>>>>>>>
>>>>>>>> On Wed, Jul 25, 2018 at 1:26 PM Colin McCabe <cmccabe@apache.org>
>>>>> wrote:
>>>>>>>>>> On Mon, Jul 23, 2018, at 23:20, James Cheng wrote:
>>>>>>>>>> Hi Stanislav! Thanks for this KIP!
>>>>>>>>>>
>>>>>>>>>> I agree that it would be good if the LogCleaner were
more tolerant
>>>>> of
>>>>>>>>>> errors. Currently, as you said, once it dies, it
stays dead.
>>>>>>>>>>
>>>>>>>>>> Things are better now than they used to be. We have
the metric
>>>>>>>>>>        kafka.log:type=LogCleanerManager,name=time-since-last-run-ms
>>>>>>>>>> which we can use to tell us if the threads are dead.
And as of
>>>>> 1.1.0,
>>>>>>>> we
>>>>>>>>>> have KIP-226, which allows you to restart the log
cleaner thread,
>>>>>>>>>> without requiring a broker restart.
>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-226+-+Dynamic+Broker+Configuration
>>>>>>>>>> I've only read about this, I haven't personally tried
it.
>>>>>>>>> Thanks for pointing this out, James!  Stanislav, we should
probably
>>>>>>> add a
>>>>>>>>> sentence or two mentioning the KIP-226 changes somewhere
in the
>> KIP.
>>>>>>>> Maybe
>>>>>>>>> in the intro section?
>>>>>>>>>
>>>>>>>>> I think it's clear that requiring the users to manually
restart the
>>>>> log
>>>>>>>>> cleaner is not a very good solution.  But it's good to
know that
>>>>> it's a
>>>>>>>>> possibility on some older releases.
>>>>>>>>>
>>>>>>>>>> Some comments:
>>>>>>>>>> * I like the idea of having the log cleaner continue
to clean as
>>>>> many
>>>>>>>>>> partitions as it can, skipping over the problematic
ones if
>>>>> possible.
>>>>>>>>>> * If the log cleaner thread dies, I think it should
automatically
>> be
>>>>>>>>>> revived. Your KIP attempts to do that by catching
exceptions
>> during
>>>>>>>>>> execution, but I think we should go all the way and
make sure
>> that a
>>>>>>>> new
>>>>>>>>>> one gets created, if the thread ever dies.
>>>>>>>>> This is inconsistent with the way the rest of Kafka works.
 We
>> don't
>>>>>>>>> automatically re-create other threads in the broker if
they
>>>>> terminate.
>>>>>>>> In
>>>>>>>>> general, if there is a serious bug in the code, respawning
threads
>> is
>>>>>>>>> likely to make things worse, by putting you in an infinite
loop
>> which
>>>>>>>>> consumes resources and fires off continuous log messages.
>>>>>>>>>
>>>>>>>>>> * It might be worth trying to re-clean the uncleanable
partitions.
>>>>>>> I've
>>>>>>>>>> seen cases where an uncleanable partition later became
cleanable.
>> I
>>>>>>>>>> unfortunately don't remember how that happened, but
I remember
>> being
>>>>>>>>>> surprised when I discovered it. It might have been
something like
>> a
>>>>>>>>>> follower was uncleanable but after a leader election
happened, the
>>>>>>> log
>>>>>>>>>> truncated and it was then cleanable again. I'm not
sure.
>>>>>>>>> James, I disagree.  We had this behavior in the Hadoop
Distributed
>>>>> File
>>>>>>>>> System (HDFS) and it was a constant source of user problems.
>>>>>>>>>
>>>>>>>>> What would happen is disks would just go bad over time.
 The
>> DataNode
>>>>>>>>> would notice this and take them offline.  But then, due
to some
>>>>>>>>> "optimistic" code, the DataNode would periodically try
to re-add
>> them
>>>>>>> to
>>>>>>>>> the system.  Then one of two things would happen: the
disk would
>> just
>>>>>>>> fail
>>>>>>>>> immediately again, or it would appear to work and then
fail after a
>>>>>>> short
>>>>>>>>> amount of time.
>>>>>>>>>
>>>>>>>>> The way the disk failed was normally having an I/O request
take a
>>>>>>> really
>>>>>>>>> long time and time out.  So a bunch of request handler
threads
>> would
>>>>>>>>> basically slam into a brick wall when they tried to access
the bad
>>>>>>> disk,
>>>>>>>>> slowing the DataNode to a crawl.  It was even worse in
the second
>>>>>>>> scenario,
>>>>>>>>> if the disk appeared to work for a while, but then failed.
 Any
>> data
>>>>>>> that
>>>>>>>>> had been written on that DataNode to that disk would
be lost, and
>> we
>>>>>>>> would
>>>>>>>>> need to re-replicate it.
>>>>>>>>>
>>>>>>>>> Disks aren't biological systems-- they don't heal over
time.  Once
>>>>>>>> they're
>>>>>>>>> bad, they stay bad.  The log cleaner needs to be robust
against
>> cases
>>>>>>>> where
>>>>>>>>> the disk really is failing, and really is returning bad
data or
>>>>> timing
>>>>>>>> out.
>>>>>>>>>> * For your metrics, can you spell out the full metric
in JMX-style
>>>>>>>>>> format, such as:
>>>>>>>>>>
>>>>>>>>   kafka.log:type=LogCleanerManager,name=uncleanable-partitions-count
>>>>>>>>>>                value=4
>>>>>>>>>>
>>>>>>>>>> * For "uncleanable-partitions": topic-partition names
can be very
>>>>>>> long.
>>>>>>>>>> I think the current max size is 210 characters (or
maybe
>> 240-ish?).
>>>>>>>>>> Having the "uncleanable-partitions" being a list
could be very
>> large
>>>>>>>>>> metric. Also, having the metric come out as a csv
might be
>> difficult
>>>>>>> to
>>>>>>>>>> work with for monitoring systems. If we *did* want
the topic names
>>>>> to
>>>>>>>> be
>>>>>>>>>> accessible, what do you think of having the
>>>>>>>>>>        kafka.log:type=LogCleanerManager,topic=topic1,partition=2
>>>>>>>>>> I'm not sure if LogCleanerManager is the right type,
but my
>> example
>>>>>>> was
>>>>>>>>>> that the topic and partition can be tags in the metric.
That will
>>>>>>> allow
>>>>>>>>>> monitoring systems to more easily slice and dice
the metric. I'm
>> not
>>>>>>>>>> sure what the attribute for that metric would be.
Maybe something
>>>>>>> like
>>>>>>>>>> "uncleaned bytes" for that topic-partition? Or
>>>>> time-since-last-clean?
>>>>>>>> Or
>>>>>>>>>> maybe even just "Value=1".
>>>>>>>>> I haven't though about this that hard, but do we really
need the
>>>>>>>>> uncleanable topic names to be accessible through a metric?
 It
>> seems
>>>>>>> like
>>>>>>>>> the admin should notice that uncleanable partitions are
present,
>> and
>>>>>>> then
>>>>>>>>> check the logs?
>>>>>>>>>
>>>>>>>>>> * About `max.uncleanable.partitions`, you said that
this likely
>>>>>>>>>> indicates that the disk is having problems. I'm not
sure that is
>> the
>>>>>>>>>> case. For the 4 JIRAs that you mentioned about log
cleaner
>> problems,
>>>>>>>> all
>>>>>>>>>> of them are partition-level scenarios that happened
during normal
>>>>>>>>>> operation. None of them were indicative of disk problems.
>>>>>>>>> I don't think this is a meaningful comparison.  In general,
we
>> don't
>>>>>>>>> accept JIRAs for hard disk problems that happen on a
particular
>>>>>>> cluster.
>>>>>>>>> If someone opened a JIRA that said "my hard disk is having
>> problems"
>>>>> we
>>>>>>>>> could close that as "not a Kafka bug."  This doesn't
prove that
>> disk
>>>>>>>>> problems don't happen, but  just that JIRA isn't the
right place
>> for
>>>>>>>> them.
>>>>>>>>> I do agree that the log cleaner has had a significant
number of
>> logic
>>>>>>>>> bugs, and that we need to be careful to limit their impact.
 That's
>>>>> one
>>>>>>>>> reason why I think that a threshold of "number of uncleanable
logs"
>>>>> is
>>>>>>> a
>>>>>>>>> good idea, rather than just failing after one IOException.
 In all
>>>>> the
>>>>>>>>> cases I've seen where a user hit a logic bug in the log
cleaner, it
>>>>> was
>>>>>>>>> just one partition that had the issue.  We also should
increase
>> test
>>>>>>>>> coverage for the log cleaner.
>>>>>>>>>
>>>>>>>>>> * About marking disks as offline when exceeding a
certain
>> threshold,
>>>>>>>>>> that actually increases the blast radius of log compaction
>> failures.
>>>>>>>>>> Currently, the uncleaned partitions are still readable
and
>> writable.
>>>>>>>>>> Taking the disks offline would impact availability
of the
>>>>> uncleanable
>>>>>>>>>> partitions, as well as impact all other partitions
that are on the
>>>>>>>> disk.
>>>>>>>>> In general, when we encounter I/O errors, we take the
disk
>> partition
>>>>>>>>> offline.  This is spelled out in KIP-112 (
>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD
>>>>>>>>> ) :
>>>>>>>>>
>>>>>>>>>> - Broker assumes a log directory to be good after
it starts, and
>>>>> mark
>>>>>>>>> log directory as
>>>>>>>>>> bad once there is IOException when broker attempts
to access (i.e.
>>>>>>> read
>>>>>>>>> or write) the log directory.
>>>>>>>>>> - Broker will be offline if all log directories are
bad.
>>>>>>>>>> - Broker will stop serving replicas in any bad log
directory. New
>>>>>>>>> replicas will only be created
>>>>>>>>>> on good log directory.
>>>>>>>>> The behavior Stanislav is proposing for the log cleaner
is actually
>>>>>>> more
>>>>>>>>> optimistic than what we do for regular broker I/O, since
we will
>>>>>>> tolerate
>>>>>>>>> multiple IOExceptions, not just one.  But it's generally
>> consistent.
>>>>>>>>> Ignoring errors is not.  In any case, if you want to
tolerate an
>>>>>>>> unlimited
>>>>>>>>> number of I/O errors, you can just set the threshold
to an infinite
>>>>>>> value
>>>>>>>>> (although I think that would be a bad idea).
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> Colin
>>>>>>>>>
>>>>>>>>>> -James
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jul 23, 2018, at 5:46 PM, Stanislav Kozlovski
<
>>>>>>>>> stanislav@confluent.io> wrote:
>>>>>>>>>>> I renamed the KIP and that changed the link.
Sorry about that.
>> Here
>>>>>>>> is
>>>>>>>>> the
>>>>>>>>>>> new link:
>>>>>>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Improve+LogCleaner+behavior+on+error
>>>>>>>>>>> On Mon, Jul 23, 2018 at 5:11 PM Stanislav Kozlovski
<
>>>>>>>>> stanislav@confluent.io>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey group,
>>>>>>>>>>>>
>>>>>>>>>>>> I created a new KIP about making log compaction
more
>>>>>>> fault-tolerant.
>>>>>>>>>>>> Please give it a look here and please share
what you think,
>>>>>>>>> especially in
>>>>>>>>>>>> regards to the points in the "Needs Discussion"
paragraph.
>>>>>>>>>>>>
>>>>>>>>>>>> KIP: KIP-346
>>>>>>>>>>>> <
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-346+-+Limit+blast+radius+of+log+compaction+failure
>>>>>>>>>>>> --
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stanislav
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best,
>>>>>>>>>>> Stanislav
>>>>>
>>>> --
>>>> Best,
>>>> Stanislav
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>


Mime
View raw message