kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ismael Juma (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
Date Wed, 27 Sep 2017 16:26:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182839#comment-16182839
] 

Ismael Juma edited comment on KAFKA-5973 at 9/27/17 4:25 PM:
-------------------------------------------------------------

[~theduderog], hmm, I don't understand why. If there is a metric, operators can simply kill
the broker themselves if that's what they want, right?


was (Author: ijuma):
[~theduderog], hmm, I don't understand why. If there is a metric, operations can simply kill
the broker themselves if that's what they want, right?

> ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5973
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5973
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.11.0.0, 0.11.0.1
>            Reporter: Tom Crayford
>             Fix For: 0.11.0.2, 1.0.1
>
>         Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with some
> threads not running, but potentially the broker can still be serving traffic to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and fine", but
in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the controller
in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them crashing for
any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka and run thread
dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker process
> We could crash the whole broker when an individual thread dies. I think this is pretty
reasonable, it's better to have a very visible breakage than a very hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and track a list
of "threads that should be running" is... pretty silly. We could instead expose a JMX bean
of some kind indicating threads that died due to uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious to users
they need to monitor and alert on them. The project could add documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message