kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhanxiang Huang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-6846) Controller can spend long time in shutting down RequestSendThread when processing BrokerChange event
Date Wed, 02 May 2018 07:21:00 GMT
Zhanxiang Huang created KAFKA-6846:

             Summary: Controller can spend long time in shutting down RequestSendThread when
processing BrokerChange event
                 Key: KAFKA-6846
                 URL: https://issues.apache.org/jira/browse/KAFKA-6846
             Project: Kafka
          Issue Type: Bug
          Components: controller
            Reporter: Zhanxiang Huang

Controller can spend a long time (more than 60s) in processing BrokerChange event when there
are dead brokers. For example, we saw entries like these in controller log:

2018/04/28 18:13:50.021 [KafkaController] [Controller 7586]: Newly added brokers: , deleted
brokers: 5222, bounced Brokers: , all live brokers: 3238,3322,5134,5177,5213,5214,5217,5218,5219,5220,5221,5319,5652,5949,7569,7574,7577,7581,7586,7589,7594,7595,7601,7609,14838,14840,14848,14855,14882,14886,14889,14901,16033
2018/04/28 18:13:50.021 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]:
Shutting down
2018/04/28 18:14:49.196 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]:
Shutdown completed
2018/04/28 18:14:49.196 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]:
2018/04/28 18:14:49.200 [KafkaController] [Controller 7586]: Broker failure callback for 5222{code}

It indicates that the time difference between RequestSendThread shutdown is initiated (18:13:50)
and shutdown completes (18:14:49) is 59s.

The root cause is that RequestSendThread will call NetworkClient.pool() in a while loop in
NetworkClientsUtils.awaitReady() and NetworkClientsUtils.sendAndReceive() without checking
the interrupt flag. This causes the interrupt triggered by controller thread only breaks poll()
for once and then the RequestSendThread will be blocked in the next poll() until it receives
the disconnected message or timeout, before it can actually finish the shutdown. During this
time period, controller event thread is blocked to wait for the shutdownComplete latch, which
is bad because we only have single controller event thread.

This issue can be resolved by making the thread throw InterruptedException right after each
poll call in awaitReady() and sendAndReceive() if it sees the interrupt flag has been set.
I will create a PR for that.


This message was sent by Atlassian JIRA

View raw message