kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-6630) Speed up the processing of TopicDeletionStopReplicaResponseReceived events on the controller
Date Thu, 08 Mar 2018 22:59:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-6630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392070#comment-16392070

ASF GitHub Bot commented on KAFKA-6630:

gitlw opened a new pull request #4668: KAFKA-6630: Speed up the processing of TopicDeletionStopReplicaResponseReceived
events on the controller
URL: https://github.com/apache/kafka/pull/4668
   This patch tries to speed up the inefficient functions identified in Kafka-6630 by grouping
partitions in the ControllerContext.partitionReplicaAssignment variable by topics. Hence trying
to find all replicas for a topic won't need to go through all the replicas in the cluster.
   Passed all tests using "gradle testAll"
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> Speed up the processing of TopicDeletionStopReplicaResponseReceived events on the controller
> --------------------------------------------------------------------------------------------
>                 Key: KAFKA-6630
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6630
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>            Reporter: Lucas Wang
>            Assignee: Lucas Wang
>            Priority: Minor
> Problem Statement:
> We find in a large cluster with many partition replicas, it takes a long time to successfully
delete a topic. 
> Root cause:
> Further analysis shows that for a topic with N replicas, the controller receives all
the N StopReplicaResponses from brokers within a short time, however sequentially handling
all the N TopicDeletionStopReplicaResponseReceived events one by one takes a long time.
> Specifically the functions triggered while handling every single TopicDeletionStopReplicaResponseReceived
event include:
> TopicDeletionStopReplicaResponseReceived.process calls TopicDeletionManager.completeReplicaDeletion,
which calls TopicDeletionManager.resumeDeletions, which calls several inefficient functions.
> The inefficient functions called inside TopicDeletionManager.resumeDeletions include
> ReplicaStateMachine.areAllReplicasForTopicDeleted
> ReplicaStateMachine.isAtLeastOneReplicaInDeletionStartedState
> ReplicaStateMachine.replicasInState
> Each of the 3 inefficient functions above will iterate through all the replicas in the
cluster, and filter out the replicas belonging to a topic. In a large cluster with many replicas,
these functions can be quite slow. 
> Total deletion time for a topic becomes long in single threaded controller processing
> Since the controller needs to sequentially process the queued TopicDeletionStopReplicaResponseReceived
events, if the time cost to process one event is t, the total time to process all events for
all replicas of a topic is N * t.

This message was sent by Atlassian JIRA

View raw message