kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raoufeh Hashemian (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (KAFKA-5781) Frequent long produce latency periods that result in reduced produce rate.
Date Thu, 24 Aug 2017 17:49:00 GMT

    [ https://issues.apache.org/jira/browse/KAFKA-5781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140386#comment-16140386
] 

Raoufeh Hashemian commented on KAFKA-5781:
------------------------------------------

Can you please guide me to the list of metrics that I can collect for latency breakdown? I
only have metrics reported through Datadog integration but can add additional metrics  if
they are sent through JMX.

> Frequent long produce latency periods that result in reduced produce rate.
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5781
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5781
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.11.0.0
>         Environment: CentOS Linux release 7.3.1611 , Kernel 3.10, java version "1.8.0_121"
>            Reporter: Raoufeh Hashemian
>         Attachments: frequent_latency_increase_diskactivity.png, frequent_latency_increase.png,
frequent_latency_increase_zoomed.png
>
>
> When we upgraded from Kafka 0.10,2 to 0.11.0 , I started to see frequent throughput drops
with a predictable pattern (attached file shows the pattern in a 14 hour period). This resulted
in an a degradation of up to 30% in our overall produce throughput.
> The drops can be correlated to the significant increase in 99th percentile latency (up
to 4 seconds). We have a cluster of 6 brokers and a single topic. The problem happens both
with/without consumers running so I only included a case without consumers.
> There is no specific message in the broker logs when the latency surge happens.  However,
I found a correlation between the log rotation messages in the log and the the longer cycles
in the pattern (details shown in the attached graph:frequent_latency_increase.png)
> Each increased latency period takes 5 to 20 minutes to finish (shown in the zoomed graph
in the attached files). 
> The broker cpu utilization goes down during this time and some read disk activity is
observed (see attached graph)
> This pattern started to appear in our environment exactly at the time when we switched
to kafka 0.11.0. We kept the idempotence as false and didn`t make any configuration change
as we switched. So I was wondering if it could be a bug or configuration that needs to be
changed after upgrade?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message