kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lucas Wang (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-6753) Speed up event processing on the controller
Date Fri, 06 Apr 2018 00:18:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-6753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Lucas Wang updated KAFKA-6753:
    Attachment: Screen Shot 2018-04-04 at 7.08.55 PM.png

> Speed up event processing on the controller 
> --------------------------------------------
>                 Key: KAFKA-6753
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6753
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Lucas Wang
>            Assignee: Lucas Wang
>            Priority: Minor
>         Attachments: Screen Shot 2018-04-04 at 7.08.55 PM.png
> The existing controller code updates metrics after processing every event. This can slow
down event processing on the controller tremendously. In one profiling we see that updating
metrics takes nearly 100% of the CPU for the controller event processing thread. Specifically
the slowness can be attributed to two factors:
> 1. Each invocation to update the metrics is expensive. Specifically trying to calculate
the offline partitions count requires iterating through all the partitions in the cluster
to check if the partition is offline; and calculating the preferred replica imbalance count
requires iterating through all the partitions in the cluster to check if a partition has a
leader other than the preferred leader. In a large cluster, the number of partitions can be
quite large, all seen by the controller. Even if the time spent to check a single partition
is small, the accumulation effect of so many partitions in the cluster can make the invocation
to update metrics quite expensive. One might argue that maybe the logic for processing each
single partition is not optimized, we checked the CPU percentage of leaf nodes in the profiling
result, and found that inside the loops of collection objects, e.g. the set of all partitions,
no single function dominates the processing. Hence the large number of the partitions in a
cluster is the main contributor to the slowness of one invocation to update the metrics.
> 2. The invocation to update metrics is called many times when the is a high number of
events to be processed by the controller, one invocation after processing any event.
> The patch that will be submitted tries to fix bullet 2 above, i.e. reducing the number
of invocations to update metrics. Instead of updating the metrics after processing any event,
we only periodically check if the metrics needs to be updated, i.e. once every second. 
> * If after the previous invocation to update metrics, there are other types of events
that changed the controller’s state, then one second later the metrics will be updated.

> * If after the previous invocation, there has been no other types of events, then the
call to update metrics can be bypassed.

This message was sent by Atlassian JIRA

View raw message