kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Onur Karaman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-4959) remove controller concurrent access to non-threadsafe NetworkClient, Selector, and SSLEngine
Date Mon, 27 Mar 2017 22:07:41 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Onur Karaman updated KAFKA-4959:
    Status: Patch Available  (was: In Progress)

> remove controller concurrent access to non-threadsafe NetworkClient, Selector, and SSLEngine
> --------------------------------------------------------------------------------------------
>                 Key: KAFKA-4959
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4959
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Onur Karaman
>            Assignee: Onur Karaman
> This brought down a cluster by causing continuous controller moves.
> ZkClient's ZkEventThread and a RequestSendThread can concurrently use objects that aren't
> * Selector
> * NetworkClient
> * SSLEngine (this was the big one for us. We turn on SSL for interbroker communication).
> As per the "Concurrency Notes" section from the [SSLEngine javadoc|https://docs.oracle.com/javase/7/docs/api/javax/net/ssl/SSLEngine.html]:
> bq. two threads must not attempt to call the same method (either wrap() or unwrap())
> SSLEngine.wrap gets called in:
> * SslTransportLayer.write
> * SslTransportLayer.handshake
> * SslTransportLayer.close
> It turns out that the ZkEventThread and RequestSendThread can concurrently call SSLEngine.wrap:
> * ZkEventThread calls SslTransportLayer.close from ControllerChannelManager.removeExistingBroker
> * RequestSendThread can call SslTransportLayer.write or SslTransportLayer.handshake from
> Suppose the controller moves for whatever reason. The former controller could have had
a RequestSendThread who was in the middle of sending out messages to the cluster while the
ZkEventThread began executing KafkaController.onControllerResignation, which calls ControllerChannelManager.shutdown,
which sequentially cleans up the controller-to-broker queue and connection for every broker
in the cluster. This cleanup includes the call to ControllerChannelManager.removeExistingBroker
as mentioned earlier, causing the concurrent call to SSLEngine.wrap. This concurrent call
throws a BufferOverflowException which ControllerChannelManager.removeExistingBroker catches
so the ControllerChannelManager.shutdown moves onto cleaning up the next controller-to-broker
queue and connection, skipping the cleanup steps such as clearing the queue, stopping the
RequestSendThread, and removing the entry from its brokerStateInfo map.
> By failing out of the Selector.close, the sensors corresponding to the broker connection
has not been cleaned up. Any later attempt at initializing an identical Selector will result
in a sensor collision and therefore cause Selector initialization to throw an exception. In
other words, any later attempts by this broker to become controller again will fail on initialization.
When controller initialization fails, the controller deletes the /controller znode and lets
another broker take over.
> Now suppose the controller moves enough times such that every broker hits the BufferOverflowException
concurrency issue. We're now guaranteed to fail controller initialization due to the sensor
collision on every controller transition, so the controller will move across brokers continuously.

This message was sent by Atlassian JIRA

View raw message