kafka-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jiao Zhang (Jira)" <j...@apache.org>
Subject [jira] [Created] (KAFKA-9616) Add new metrics to get total response time with throttle time subtracted
Date Thu, 27 Feb 2020 07:12:00 GMT
Jiao Zhang created KAFKA-9616:

             Summary: Add new metrics to get total response time with throttle time subtracted
                 Key: KAFKA-9616
                 URL: https://issues.apache.org/jira/browse/KAFKA-9616
             Project: Kafka
          Issue Type: Improvement
          Components: core
    Affects Versions: 1.1.0
            Reporter: Jiao Zhang

We are using these RequestMetrics for our cluster monitoring [https://github.com/apache/kafka/blob/fb5bd9eb7cdfdae8ed1ea8f68e9be5687f610b28/core/src/main/scala/kafka/network/RequestChannel.scala#L364]

and config our AlertManager to fire alerts if 99th value of 'TotalTimeMs' exceeds the threshold
value. This alert is very important as it really notifies cluster administrators the bad situation
for example when one server is bailed out from cluster or lost leadership.

But we suffer from false alerts sometimes. This is the case. We set quota like 'producer_byte_rate'
for some clients, so when requests from these clients are throttled, 'ThrottleTimeMs' is long
and sometimes due to throttle 'TotalTimeMs' exceeds the threshold value and alert is triggered.
As a result we have to put some time to check details for false alerts either.

So this ticket proposes to add a new metrics 'ProcessTimeMs', the value of which is total
response time with throttle time subtracted. This metrics is more accurate and could help
us only notice the really unexpected situation.

Btw, we tried to achieve this by using PromQL against existing metrics, like Total - Throttle.
But it does not work as it seems these two metrics are inconsistent in time. So better to
expose a new metrics from broker side.

This message was sent by Atlassian Jira

View raw message