flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Harper <Daniel.Har...@bbc.co.uk>
Subject Re: User ClassLoader leak on job restart
Date Wed, 16 Jan 2019 16:11:43 GMT
Hi Andrew, Til,

Redoing the job in Flink will take a while, and upgrading to a different version of Flink
is tricky (we use EMR)

However, I was just working on putting together a minimal job, when I noticed something that
might be interesting/might be a red herring.

I enabled the following settings on the job

-XX:MaxMetaspaceSize=150M
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/dump/

This caused the job to dump the heap dumps in the above location (previously I’d taken a
heap dump before the OOM). Redoing the same process of downloading one of the heap dumps,
looking at the ‘lingering’ class loaders and then clicking a FlinkUserClassLoader ->
GC Roots -> All references I could see the following


[cid:180C229F-4C6D-40D6-A6BD-BA685F4849DE]


This looks to me like https://github.com/FasterXML/jackson-databind/issues/1363 and looks
like it stems from the beam PipelineOptions class (at least that’s the way I’m interpreting
it)
I’m going to try and reproduce this with a simple job and raise it on the BEAM mailing list…

Will tell you how I get on






From: Till Rohrmann <trohrmann@apache.org<mailto:trohrmann@apache.org>>
Date: Wednesday, 16 January 2019 at 09:56
To: Andrey Zagrebin <andrey@da-platform.com<mailto:andrey@da-platform.com>>
Cc: Daniel Harper <Daniel.Harper@bbc.co.uk<mailto:Daniel.Harper@bbc.co.uk>>, "user@flink.apache.org<mailto:user@flink.apache.org>"
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: User ClassLoader leak on job restart

Hi Daniel,

would it be possible to run directly on Flink in order to take Beam out of the equation? Moreover,
I would be interested if the same problem still occurs with the latest Flink version or at
least Flink 1.5.6. It is hard to tell whether Flink or the Beam Flink runner causes the class
loader leak.

Cheers,
Till

On Tue, Jan 15, 2019 at 7:17 PM Andrey Zagrebin <andrey@da-platform.com<mailto:andrey@da-platform.com>>
wrote:
Hi Daniel,

could you share the code of minimum viable example of the job failing this way to analyse
the thread dump of it?

Best,
Andrey

On Tue, Jan 15, 2019 at 3:59 PM Daniel Harper <Daniel.Harper@bbc.co.uk<mailto:Daniel.Harper@bbc.co.uk>>
wrote:

Environment/Context:

Flink 1.5.2
Beam 2.7.0
AWS EMR 5.17.0
Orchestrator: YARN
Nature of job:

  *   Source: Amazon Kinesis
  *   Sink:
     *   Amazon S3

We execute our job by creating a fresh YARN session each time using `flink run`

We have noticed that when our job restarts due to an exception, the number of classes loaded
increases which in turn, pushes the MetaSpace memory usage. Eventually after a number of restarts
YARN will kill the container for pushing the memory beyond its physical limits.

My colleagues and I have documented this in the following issues (we recognise this might
seem disorganised!)

https://issues.apache.org/jira/browse/FLINK-10928

 https://issues.apache.org/jira/browse/FLINK-11205

https://issues.apache.org/jira/browse/FLINK-10317

We’ve tried setting the setting -XX:MetaSpaceSize=180M, which prevents YARN from killing
the container, however the TaskManager throws the exception java.lang.OutOfMemoryError: Metaspace
We’ve tried putting the job jar in the flink/lib directory but this seems to present more
problems (log4j seems to stop working? We were also seeing issues around network connectivity
to Amazon S3 which we cannot explain but can reproduce when using this approach)

I took a heap dump of one of the task managers after it had restarted 6 times and followed
this guide http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/
using Eclipse MAT

As you can see, there are 6 FlinkUserCodeClassLoader present in the heap dump

[cid:16852b7d7c0f09e33c91]

When selecting one of these and clicking Path to GC Roots -> With All References, all it
seems to show is


[cid:16852b7d7c0c05276062]


This is as far as I’m able to grasp in terms of understanding so I’m not sure what to
look at next. One point of discussion on that blog post is around ThreadLocals keeping state
around but I’m not sure.

Does anyone have any guidance/where to look next? Any help would be very much appreciated!




----------------------------

http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify
the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

---------------------



----------------------------

http://www.bbc.co.uk
This e-mail (and any attachments) is confidential and may contain personal views which are
not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on it and notify
the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

---------------------
Mime
View raw message