flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Sun <ha...@zendesk.com>
Subject Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?
Date Thu, 16 Nov 2017 16:48:38 GMT
Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still
running inside kubernetes, but it is not responding to any requests,
probably due to high load. And from JM side, JM lost heartbeat tracking of
the TM, so it marked the TM as died.

The „volume“ of Kafka topics, I mean, the volume of messages for a topic.
e.g. 10000 msg/sec, I have not check the size of the message yet.
But overall, as you suggested, I think I need more tuning for my TM params,
so it can maintain a reasonable load. I am not sure what params to look
for, but I will do my research first.

Always thanks for your help Stefan.

On Thu, Nov 16, 2017 at 8:27 AM Stefan Richter <s.richter@data-artisans.com>

> Hi,
> In addition to your comments, what are the items retained by
> NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
> Mostly the network buffers, which should be ok. They are always recycled
> and should not be released until the network environment is GCed.
> I think there is a GC issue because my task manager is killed somehow
> after a job run. The duration correlates to the volume of Kafka topics.
> More volume TM dies quickly. Do you have any tips to debug it?
> What killed your task manager? For example do you see a see an
> java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer?
> In case of an OOM killer, you might need to grant more process memory or
> reduce the memory that you have configured for Flink to stay below the
> configured threshold that would kill the process. What exactly do you mean
> by „volume“ of Kafka topics?
> To debug, I suggest that you first figure out why the process is killed,
> maybe your thresholds are simply to low and the consumption can go beyond
> with your configuration of Flink. Then you should figure out what is
> actually growing more than you expect, e.g. is the problem triggered by
> heap space or native memory? Depending on the answer, e.g. heap dumps could
> help to spot the problematic objects.
> Best,
> Stefan

View raw message