kafka-jira mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Tweedie (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (KAFKA-6199) Single broker with fast growing heap usage
Date Fri, 10 Nov 2017 14:07:00 GMT

     [ https://issues.apache.org/jira/browse/KAFKA-6199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robin Tweedie updated KAFKA-6199:
---------------------------------
    Description: 
We have a single broker in our cluster of 25 with fast growing heap usage which necessitates
us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from
long GC pauses and eventually has {{OutOfMemory}} errors.

See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage percentage on
the broker. A "normal" broker in the same cluster stays below 50% (averaged) over the same
time period.

We have taken heap dumps when the broker's heap usage is getting dangerously high, and there
are a lot of retained {{NetworkSend}} objects referencing byte buffers.

We also noticed that the single affected broker logs a lot more of this kind of warning than
any other broker:
{noformat}
WARN Attempting to send response via channel for which there is no open connection, connection
id 13 (kafka.network.Processor)
{noformat}

See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log message visualized
across all the brokers (to show it happens a bit on other brokers, but not nearly as much
as it does on the "bad" broker).

I can't make the heap dumps public, but would appreciate advice on how to pin down the problem
better. We're currently trying to narrow it down to a particular client, but without much
success so far.

Let me know what else I could investigate or share to track down the source of this leak.

  was:
We have a single broker in our cluster of 25 with fast growing heap usage which necessitates
us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from
long GC pauses and eventually has {{OutOfMemory}} errors.

Here's a graph of heap usage percentage. A "normal" broker in the same cluster stays below
50% (averaged) over the same time period.

!Screen Shot 2017-11-10 at 11.59.06 AM.png|thumbnail!

We have taken heap dumps when the broker's heap usage is getting dangerously high, and there
are a lot of retained {{NetworkSend}} objects referencing byte buffers.

We also noticed that the single affected broker logs a lot more of this kind of warning than
any other broker:
{noformat}
WARN Attempting to send response via channel for which there is no open connection, connection
id 13 (kafka.network.Processor)
{noformat}

Here are counts of that WARN message visualized across all the brokers (to show it happens
a bit on other brokers, but not nearly as much as it does on the broker):
!Screen Shot 2017-11-10 at 1.55.33 PM.png|thumbnail!

I can't make the heap dumps public, but would appreciate advice on how to pin down the problem
better. We're currently trying to narrow it down to a particular client, but without much
success so far.

Let me know what else I could investigate or share to track down the source of this leak.


> Single broker with fast growing heap usage
> ------------------------------------------
>
>                 Key: KAFKA-6199
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6199
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.1
>         Environment: Amazon Linux
>            Reporter: Robin Tweedie
>         Attachments: Screen Shot 2017-11-10 at 1.55.33 PM.png, Screen Shot 2017-11-10
at 11.59.06 AM.png
>
>
> We have a single broker in our cluster of 25 with fast growing heap usage which necessitates
us restarting it every 12 hours. If we don't restart the broker, it becomes very slow from
long GC pauses and eventually has {{OutOfMemory}} errors.
> See {{Screen Shot 2017-11-10 at 11.59.06 AM.png}} for a graph of heap usage percentage
on the broker. A "normal" broker in the same cluster stays below 50% (averaged) over the same
time period.
> We have taken heap dumps when the broker's heap usage is getting dangerously high, and
there are a lot of retained {{NetworkSend}} objects referencing byte buffers.
> We also noticed that the single affected broker logs a lot more of this kind of warning
than any other broker:
> {noformat}
> WARN Attempting to send response via channel for which there is no open connection, connection
id 13 (kafka.network.Processor)
> {noformat}
> See {{Screen Shot 2017-11-10 at 1.55.33 PM.png}} for counts of that WARN log message
visualized across all the brokers (to show it happens a bit on other brokers, but not nearly
as much as it does on the "bad" broker).
> I can't make the heap dumps public, but would appreciate advice on how to pin down the
problem better. We're currently trying to narrow it down to a particular client, but without
much success so far.
> Let me know what else I could investigate or share to track down the source of this leak.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message