kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ismael Juma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5973) ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
Date Wed, 27 Sep 2017 15:08:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16182705#comment-16182705
] 

Ismael Juma commented on KAFKA-5973:
------------------------------------

Because of the statefulness of Kafka brokers, you may not want to kill it if a thread dies.
It may be better to trigger an alert via a metric and let the Ops team decide how they would
like to handle it. In some cases, you may want to run some additional diagnostics while the
broker is still running. Also, imagine a situation where a software bug causes one thread
to die in multiple brokers. This could be a somewhat harmless situation, but if each of them
immediately commits suicide, you may have a serious outage.

> ShutdownableThread catching errors can lead to partial hard to diagnose broker failure
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-5973
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5973
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.11.0.0, 0.11.0.1
>            Reporter: Tom Crayford
>            Priority: Minor
>             Fix For: 1.0.0, 0.11.0.2
>
>         Attachments: 5973.v1.txt
>
>
> When any kafka broker {{ShutdownableThread}} subclasses crashes due to an
> uncaught exception, the broker is left running in a very weird/bad state with some
> threads not running, but potentially the broker can still be serving traffic to
> users but not performing its usual operations.
> This is problematic, because monitoring may say that "the broker is up and fine", but
in fact it is not healthy.
> At Heroku we've been mitigating this by monitoring all threads that "should" be
> running on a broker and alerting when a given thread isn't running for some
> reason.
> Things that use {{ShutdownableThread}} that can crash and leave a broker/the controller
in a bad state:
> - log cleaner
> - replica fetcher threads
> - controller to broker send threads
> - controller topic deletion threads
> - quota throttling reapers
> - io threads
> - network threads
> - group metadata management threads
> Some of these can have disasterous consequences, and nearly all of them crashing for
any reason is a cause for alert.
> But, users probably shouldn't have to know about all the internals of Kafka and run thread
dumps periodically as part of normal operations.
> There are a few potential options here:
> 1. On the crash of any {{ShutdownableThread}}, shutdown the whole broker process
> We could crash the whole broker when an individual thread dies. I think this is pretty
reasonable, it's better to have a very visible breakage than a very hard to detect one.
> 2. Add some healthcheck JMX bean to detect these thread crashes
> Users having to audit all of Kafka's source code on each new release and track a list
of "threads that should be running" is... pretty silly. We could instead expose a JMX bean
of some kind indicating threads that died due to uncaught exceptions
> 3. Do nothing, but add documentation around monitoring/logging that exposes this error
> These thread deaths *do* emit log lines, but it's not that clear or obvious to users
they need to monitor and alert on them. The project could add documentation



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message