kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ilya morgenshtern (Jira)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-10901) Lock contention on high produce rate causing cluster degregation
Date Wed, 06 Jan 2021 18:19:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-10901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

ilya morgenshtern updated KAFKA-10901:
--------------------------------------
    Description: 
scaling up (20 -> 40) producers causing idle percentage to drop from 70-80% into 0-1 %,
the request queue size to increase by 200%, and overall producers latency increased by 700%.
 also, the CPU usage dropped by 30%

after we ran some profiling we saw that there is high lock contention on the write requests
but, CPU remained low, we didn't we any strange activity in the disk write/read/IOPS only
the other way around because everything became slower the cluster processed much fewer data.

!Screen Shot 2021-01-04 at 11.46.47.png|width=576,height=23!

in comparison when there were 20 producers, you can see that the ratio of produce/fetch is

!Screen Shot 2021-01-04 at 11.46.08.png|width=567,height=31!

 

from limited observation, we saw the number of the produce request from this upscaled producer
increased from 1500 to 2500(150 per broker) per sec, but overall produce request in the cluster
remained the same, on the other hand, the number of fetch requests decreased by 50%

to fix the issue we increased this specific producer linger.ms from 500ms to 1000ms and suddenly
the whole cluster became healthy.

 

  was:
scaling up (20 -> 40) producers causing idle percentage to drop from 70-80% into 0-1 %,
the request queue size to increase by 200%, and overall producers latency increased by 700%.
also, the CPU usage dropped by 30%


after we ran some profiling we saw that there is high lock contention on the write requests
but, CPU remained low, we didn't we any strange activity in the disk write/read/IOPS only
the other way around because everything became slower the cluster processed much fewer data.

!Screen Shot 2021-01-04 at 11.46.47.png|width=576,height=23!

in comparison when there were 20 producers, you can see that the ratio of produce/fetch is

!Screen Shot 2021-01-04 at 11.46.08.png|width=567,height=31!

 

from limited observation, we saw the number of the produce request from this upscaled producer
increased from 1500 to 2500 per sec, but overall produce request in the cluster remained the
same, on the other hand, the number of fetch requests decreased by 50%

to fix the issue we increased this specific producer linger.ms from 500ms to 1000ms and suddenly
the whole cluster became healthy.
 



 


> Lock contention on high produce rate causing cluster degregation
> ----------------------------------------------------------------
>
>                 Key: KAFKA-10901
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10901
>             Project: Kafka
>          Issue Type: Bug
>          Components: producer 
>    Affects Versions: 2.5.0
>         Environment: broker: version 2.5.0 with 8 cores,32gb, hdd
> producer: Sarama producer version 1.5.2(go) with 500ms linger and 2mb batch-size
>            Reporter: ilya morgenshtern
>            Priority: Major
>         Attachments: Screen Shot 2021-01-04 at 11.46.08.png, Screen Shot 2021-01-04 at
11.46.47.png
>
>
> scaling up (20 -> 40) producers causing idle percentage to drop from 70-80% into 0-1
%, the request queue size to increase by 200%, and overall producers latency increased by
700%.
>  also, the CPU usage dropped by 30%
> after we ran some profiling we saw that there is high lock contention on the write requests
but, CPU remained low, we didn't we any strange activity in the disk write/read/IOPS only
the other way around because everything became slower the cluster processed much fewer data.
> !Screen Shot 2021-01-04 at 11.46.47.png|width=576,height=23!
> in comparison when there were 20 producers, you can see that the ratio of produce/fetch
is
> !Screen Shot 2021-01-04 at 11.46.08.png|width=567,height=31!
>  
> from limited observation, we saw the number of the produce request from this upscaled
producer increased from 1500 to 2500(150 per broker) per sec, but overall produce request
in the cluster remained the same, on the other hand, the number of fetch requests decreased
by 50%
> to fix the issue we increased this specific producer linger.ms from 500ms to 1000ms and
suddenly the whole cluster became healthy.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message