kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kyle Ambroff-Kao (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-6469) ISR change notification queue can prevent controller from making progress
Date Tue, 23 Jan 2018 04:06:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Kyle Ambroff-Kao updated KAFKA-6469:
    Summary: ISR change notification queue can prevent controller from making progress  (was:
ISR change notification queue has a maximum size)

> ISR change notification queue can prevent controller from making progress
> -------------------------------------------------------------------------
>                 Key: KAFKA-6469
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6469
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Kyle Ambroff-Kao
>            Priority: Major
> When the writes /isr_change_notification in ZooKeeper (which is effectively a queue
of ISR change events for the controller) happen at a rate high enough that the node with a
watch can't keep up dequeuing them, the trouble starts.
> The watcher kafka.controller.IsrChangeNotificationListener is fired in the controller
when a new entry is written to /isr_change_notification, and the zkclient library sends a
GetChildrenRequest to zookeeper to fetch all child znodes. The size of the GetChildrenResponse
returned by ZooKeeper is the problem. Reading through the code and running some tests to confirm
shows that an empty GetChildrenResponse is 4 bytes on the wire, and every child node name
minimum 4 bytes as well. Since these znodes are length 21, that means every child znode will
account for 25 bytes in the response.
> A GetChildrenResponse with 42k child nodes of the same length will be just about 1.001MB,
which is larger than the 1MB data frame that ZooKeeper uses. This causes the ZooKeeper server
to drop the broker's session.
> So if 42k ISR changes happen at once, and the controller pauses at just the right time,
you'll end up with a queue that can no longer be drained.
> We've seen this happen in one of our test clusters as the partition count started to
climb north of 60k per broker. We had a hardware failure that lead to the cluster writing
so many child nodes to /isr_change_notification that the controller could no longer list its
children, effectively bricking the cluster.
> This can be partially mitigated by chunking ISR notifications to increase the maximum
number of partitions a broker can host.

This message was sent by Atlassian JIRA

View raw message