Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Piotr Nowojski <piotr@data-artisans.com>
Message-Id: <9E44A461-061E-4474-AE1E-9E9D70DA23BD@data-artisans.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_8241DE43-4FF9-4491-BB43-BD8AE94DB6DD"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: Flink memory leak
Date: Tue, 14 Nov 2017 16:02:24 +0100
In-Reply-To: <3edbdfc7da76ff2c2deecc4280f2168a@cs.hacettepe.edu.tr>
Cc: Aljoscha Krettek <aljoscha@apache.org>,
 Nico Kruber <nico@data-artisans.com>,
 Ufuk Celebi <uce@apache.org>,
 user <user@flink.apache.org>
To: =?utf-8?B?w4dFVMSwTktBWUEgRUJSVSDDh0VUxLBOS0FZQSBFQlJV?= <b20926247@cs.hacettepe.edu.tr>,
 Javier Lopez <javier.lopez@zalando.de>,
 pompermaier@okkam.it
References: <58D3AB1E-821D-46A2-8F04-8BB708037C2B@cs.hacettepe.edu.tr>
 <EF1606F3-6C62-4B4D-8942-CA9326CBDE7D@cs.hacettepe.edu.tr>
 <CAKiyyaE9Q+NYiCTdzOztFA4gmWS0tF64kvpqFrDuhKD-d1_AQA@mail.gmail.com>
 <30C290C1-3FFD-479D-A926-861B110638BD@cs.hacettepe.edu.tr>
 <CAKiyyaFoQH5FG_dQ85Uy=kHvsnMFSipUeG956uKbpqS00e8Obw@mail.gmail.com>
 <8c5facf8eda9ed6197c9e1c36099f8bb@cs.hacettepe.edu.tr>
 <CAELUF_AK_Ju9uc7tWt6WPHHJAm6eqZOmhftPZmsfO_SP6kS_nw@mail.gmail.com>
 <BEF55744-31EC-4842-ACC0-6BA7165220A5@apache.org>
 <CANq+ctq3FoQLe6xnVbSchZ_3NM130+qp1RH11pmM7C81LYoAvw@mail.gmail.com>
 <D13E5B00-B50A-4C51-A4C9-FDA38D362F3B@cs.hacettepe.edu.tr>
 <B337B5FB-E19C-4BD7-B569-3D1776F8AB16@data-artisans.com>
 <e3a3d62555d2f4d26170d4e3af1fce16@cs.hacettepe.edu.tr>
 <CANq+ctopgE_Z=fvU+BwK0LFMqcn-d4CCB7PGi8KirB=PjtQ5gw@mail.gmail.com>
 <14DD60D2-6E14-4DF3-836C-0A712F68B8C2@data-artisans.com>
 <CANq+ctq2nPGC+mVgdwzcHfD5THxHvm6BPWvA4Td=xa-55JWnfA@mail.gmail.com>
 <95582EE2-9F90-43B8-8B3A-71A027A5DB1B@data-artisans.com>
 <CANq+ctrnGMb6vGocNkNi6vUpQAseEivpT1DfGX3qn3cM8GDoEQ@mail.gmail.com>
 <B3A92502-A391-4F78-82E7-78323B5A645A@data-artisans.com>
 <972D0323-A628-45C6-AD1B-654EBD95AD96@data-artisans.com>
 <0019c4f2caf07e0fbde7a73ab7361002@cs.hacettepe.edu.tr>
 <9021B6B4-8A8D-4DBF-9C24-AC1F06956A4C@data-artisans.com>
 <c8a545df4fc632e8edcd4d0b1c5267ea@cs.hacettepe.edu.tr>
 <C594067D-31C6-44DF-B5CF-7017127D0DD6@data-artisans.com>
 <57a785fab1f98ee12060d42adad186ff@cs.hacettepe.edu.tr>
 <633C1FF0-9221-4689-9401-B3708DA305F4@data-artisans.com>
 <3edbdfc7da76ff2c2deecc4280f2168a@cs.hacettepe.edu.tr>
archived-at: Tue, 14 Nov 2017 15:02:37 -0000


--Apple-Mail=_8241DE43-4FF9-4491-BB43-BD8AE94DB6DD
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Ebru, Javier, Flavio:

I tried to reproduce memory leak by submitting a job, that was =
generating classes with random names. And indeed I have found one. =
Memory was accumulating in `char[]` instances that belonged to =
`java.lang.ClassLoader#parallelLockMap`. OldGen memory pool was growing =
in size up to the point I got:

java.lang.OutOfMemoryError: Java heap space

This seems like an old known =E2=80=9Cfeature=E2=80=9D of JDK:
https://bugs.openjdk.java.net/browse/JDK-8037342 =
<https://bugs.openjdk.java.net/browse/JDK-8037342>

Can any of you confirm that this is the issue that you are experiencing? =
If not, I would really need more help/information from you to track this =
down.

Piotrek

> On 10 Nov 2017, at 15:12, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA =
EBRU <b20926247@cs.hacettepe.edu.tr> wrote:
>=20
> On 2017-11-10 13:14, Piotr Nowojski wrote:
>> jobmanager1.log and taskmanager2.log are the same. Can you also =
submit
>> files containing std output?
>> Piotrek
>>> On 10 Nov 2017, at 09:35, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA=
 EBRU <b20926247@cs.hacettepe.edu.tr> wrote:
>>> On 2017-11-10 11:04, Piotr Nowojski wrote:
>>>> Hi,
>>>> Thanks for the logs, however I do not see before mentioned =
exceptions
>>>> in it. It ends with java.lang.InterruptedException
>>>> Is it the correct log file? Also, could you attach the std output =
file
>>>> of the failing TaskManager?
>>>> Piotrek
>>>>> On 10 Nov 2017, at 08:42, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKA=
YA EBRU <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>> On 2017-11-09 20:08, Piotr Nowojski wrote:
>>>>>> Hi,
>>>>>> Could you attach full logs from those task managers? At first =
glance I
>>>>>> don=E2=80=99t see a connection between those exceptions and any =
memory issue
>>>>>> that you might had. It looks like a dependency issue in one =
(some?
>>>>>> All?) of your jobs.
>>>>>> Did you build your jars with -Pbuild-jar profile as described =
here:
>>>>>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.3/quickstart/jav=
a_api_quickstart.html#build-project
>>>>>> ?
>>>>>> If that doesn=E2=80=99t help. Can you binary search which job is =
causing the
>>>>>> problem? There might be some Flink incompatibility between =
different
>>>>>> versions and rebuilding a job=E2=80=99s jar with a version =
matching to the
>>>>>> cluster version might help.
>>>>>> Piotrek
>>>>>>> On 9 Nov 2017, at 17:36, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NK=
AYA EBRU
>>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> On 2017-11-08 18:30, Piotr Nowojski wrote:
>>>>>>> Btw, Ebru:
>>>>>>> I don=E2=80=99t agree that the main suspect is =
NetworkBufferPool. On your
>>>>>>> screenshots it=E2=80=99s memory consumption was reasonable and =
stable:
>>>>>>> 596MB
>>>>>>> -> 602MB -> 597MB.
>>>>>>> PoolThreadCache memory usage ~120MB is also reasonable.
>>>>>>> Do you experience any problems, like Out Of Memory
>>>>>>> errors/crashes/long
>>>>>>> GC pauses? Or just JVM process is using more memory over time? =
You
>>>>>>> are
>>>>>>> aware that JVM doesn=E2=80=99t like to release memory back to OS =
once it
>>>>>>> was
>>>>>>> used? So increasing memory usage until hitting some limit (for
>>>>>>> example
>>>>>>> JVM max heap size) is expected behaviour.
>>>>>>> Piotrek
>>>>>>> On 8 Nov 2017, at 15:48, Piotr Nowojski =
<piotr@data-artisans.com>
>>>>>>> wrote:
>>>>>>> I don=E2=80=99t know if this is relevant to this issue, but I =
was
>>>>>>> constantly getting failures trying to reproduce this leak using =
your
>>>>>>> Job, because you were using non deterministic getKey function:
>>>>>>> @Override
>>>>>>> public Integer getKey(Integer event) {
>>>>>>> Random randomGen =3D new Random((new Date()).getTime());
>>>>>>> return randomGen.nextInt() % 8;
>>>>>>> }
>>>>>>> And quoting Java doc of KeySelector:
>>>>>>> "If invoked multiple times on the same object, the returned key =
must
>>>>>>> be the same.=E2=80=9D
>>>>>>> I=E2=80=99m trying to reproduce this issue with following job:
>>>>>>> =
https://gist.github.com/pnowojski/b80f725c1af7668051c773438637e0d3
>>>>>>> Where IntegerSource is just an infinite source, DisardingSink is
>>>>>>> well just discarding incoming data. I=E2=80=99m cancelling the =
job every 5
>>>>>>> seconds and so far (after ~15 minutes) my memory consumption is
>>>>>>> stable, well below maximum java heap size.
>>>>>>> Piotrek
>>>>>>> On 8 Nov 2017, at 15:28, Javier Lopez <javier.lopez@zalando.de>
>>>>>>> wrote:
>>>>>>> Yes, I tested with just printing the stream. But it could take a
>>>>>>> lot of time to fail.
>>>>>>> On Wednesday, 8 November 2017, Piotr Nowojski
>>>>>>> <piotr@data-artisans.com> wrote:
>>>>>>> Thanks for quick answer.
>>>>>>> So it will also fail after some time with `fromElements` source
>>>>>>> instead of Kafka, right?
>>>>>>> Did you try it also without a Kafka producer?
>>>>>>> Piotrek
>>>>>>> On 8 Nov 2017, at 14:57, Javier Lopez <javier.lopez@zalando.de>
>>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> You don't need data. With data it will die faster. I tested as
>>>>>>> well with a small data set, using the fromElements source, but =
it
>>>>>>> will take some time to die. It's better with some data.
>>>>>>> On 8 November 2017 at 14:54, Piotr Nowojski
>>>>>>> <piotr@data-artisans.com> wrote:
>>>>>>> Hi,
>>>>>>> Thanks for sharing this job.
>>>>>>> Do I need to feed some data to the Kafka to reproduce this
>>>>>> issue with your script?
>>>>>>>> Does this OOM issue also happen when you are not using the
>>>>>> Kafka source/sink?
>>>>>>>> Piotrek
>>>>>>>> On 8 Nov 2017, at 14:08, Javier Lopez <javier.lopez@zalando.de>
>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> This is the test flink job we created to trigger this leak
>>>>>> =
https://gist.github.com/javieredo/c6052404dbe6cc602e99f4669a09f7d6
>>>>>>>> And this is the python script we are using to execute the job
>>>>>> thousands of times to get the OOM problem
>>>>>> =
https://gist.github.com/javieredo/4825324d5d5f504e27ca6c004396a107
>>>>>>>> The cluster we used for this has this configuration:
>>>>>>>> Instance type: t2.large
>>>>>>>> Number of workers: 2
>>>>>>>> HeapMemory: 5500
>>>>>>>> Number of task slots per node: 4
>>>>>>>> TaskMangMemFraction: 0.5
>>>>>>>> NumberOfNetworkBuffers: 2000
>>>>>>>> We have tried several things, increasing the heap, reducing the
>>>>>> heap, more memory fraction, changes this value in the
>>>>>> taskmanager.sh "TM_MAX_OFFHEAP_SIZE=3D"2G"; and nothing seems to
>>>>>> work.
>>>>>>>> Thanks for your help.
>>>>>>>> On 8 November 2017 at 13:26, =C3=87ET=C4=B0NKAYA EBRU =
=C3=87ET=C4=B0NKAYA EBRU
>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> On 2017-11-08 15:20, Piotr Nowojski wrote:
>>>>>>> Hi Ebru and Javier,
>>>>>>> Yes, if you could share this example job it would be helpful.
>>>>>>> Ebru: could you explain in a little more details how does
>>>>>> your Job(s)
>>>>>>> look like? Could you post some code? If you are just using
>>>>>> maps and
>>>>>>> filters there shouldn=E2=80=99t be any network transfers =
involved,
>>>>>> aside
>>>>>>> from Source and Sink functions.
>>>>>>> Piotrek
>>>>>>> On 8 Nov 2017, at 12:54, ebru
>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> Hi Javier,
>>>>>>> It would be helpful if you share your test job with us.
>>>>>>> Which configurations did you try?
>>>>>>> -Ebru
>>>>>>> On 8 Nov 2017, at 14:43, Javier Lopez
>>>>>> <javier.lopez@zalando.de>
>>>>>>> wrote:
>>>>>>> Hi,
>>>>>>> We have been facing a similar problem. We have tried some
>>>>>> different
>>>>>>> configurations, as proposed in other email thread by Flavio
>>>>>> and
>>>>>>> Kien, but it didn't work. We have a workaround similar to
>>>>>> the one
>>>>>>> that Flavio has, we restart the taskmanagers once they reach
>>>>>> a
>>>>>>> memory threshold. We created a small test to remove all of
>>>>>> our
>>>>>>> dependencies and leave only flink native libraries. This
>>>>>> test reads
>>>>>>> data from a Kafka topic and writes it back to another topic
>>>>>> in
>>>>>>> Kafka. We cancel the job and start another every 5 seconds.
>>>>>> After
>>>>>>> ~30 minutes of doing this process, the cluster reaches the
>>>>>> OS memory
>>>>>>> limit and dies.
>>>>>>> Currently, we have a test cluster with 8 workers and 8 task
>>>>>> slots
>>>>>>> per node. We have one job that uses 56 slots, and we cannot
>>>>>> execute
>>>>>>> that job 5 times in a row because the whole cluster dies. If
>>>>>> you
>>>>>>> want, we can publish our test job.
>>>>>>> Regards,
>>>>>>> On 8 November 2017 at 11:20, Aljoscha Krettek
>>>>>> <aljoscha@apache.org>
>>>>>>> wrote:
>>>>>>> @Nico & @Piotr Could you please have a look at this? You
>>>>>> both
>>>>>>> recently worked on the network stack and might be most
>>>>>> familiar with
>>>>>>> this.
>>>>>>> On 8. Nov 2017, at 10:25, Flavio Pompermaier
>>>>>> <pompermaier@okkam.it>
>>>>>>> wrote:
>>>>>>> We also have the same problem in production. At the moment
>>>>>> the
>>>>>>> solution is to restart the entire Flink cluster after every
>>>>>> job..
>>>>>>> We've tried to reproduce this problem with a test (see
>>>>>>> https://issues.apache.org/jira/browse/FLINK-7845 [1]) but we
>>>>>> don't
>>>>>>> know whether the error produced by the test and the leak are
>>>>>>> correlated..
>>>>>>> Best,
>>>>>>> Flavio
>>>>>>> On Wed, Nov 8, 2017 at 9:51 AM, =C3=87ET=C4=B0NKAYA EBRU =
=C3=87ET=C4=B0NKAYA
>>>>>> EBRU
>>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> On 2017-11-07 16:53, Ufuk Celebi wrote:
>>>>>>> Do you use any windowing? If yes, could you please share
>>>>>> that code?
>>>>>>> If
>>>>>>> there is no stateful operation at all, it's strange where
>>>>>> the list
>>>>>>> state instances are coming from.
>>>>>>> On Tue, Nov 7, 2017 at 2:35 PM, ebru
>>>>>> <b20926247@cs.hacettepe.edu.tr>
>>>>>>> wrote:
>>>>>>> Hi Ufuk,
>>>>>>> We don=E2=80=99t explicitly define any state descriptor. We only
>>>>>> use map
>>>>>>> and filters
>>>>>>> operator. We thought that gc handle clearing the flink=E2=80=99s
>>>>>> internal
>>>>>>> states.
>>>>>>> So how can we manage the memory if it is always increasing?
>>>>>>> - Ebru
>>>>>>> On 7 Nov 2017, at 16:23, Ufuk Celebi <uce@apache.org> wrote:
>>>>>>> Hey Ebru, the memory usage might be increasing as long as a
>>>>>> job is
>>>>>>> running.
>>>>>>> This is expected (also in the case of multiple running
>>>>>> jobs). The
>>>>>>> screenshots are not helpful in that regard. :-(
>>>>>>> What kind of stateful operations are you using? Depending on
>>>>>> your
>>>>>>> use case,
>>>>>>> you have to manually call `clear()` on the state instance in
>>>>>> order
>>>>>>> to
>>>>>>> release the managed state.
>>>>>>> Best,
>>>>>>> Ufuk
>>>>>>> On Tue, Nov 7, 2017 at 12:43 PM, ebru
>>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> Begin forwarded message:
>>>>>>> From: ebru <b20926247@cs.hacettepe.edu.tr>
>>>>>>> Subject: Re: Flink memory leak
>>>>>>> Date: 7 November 2017 at 14:09:17 GMT+3
>>>>>>> To: Ufuk Celebi <uce@apache.org>
>>>>>>> Hi Ufuk,
>>>>>>> There are there snapshots of htop output.
>>>>>>> 1. snapshot is initial state.
>>>>>>> 2. snapshot is after submitted one job.
>>>>>>> 3. Snapshot is the output of the one job with 15000 EPS. And
>>>>>> the
>>>>>>> memory
>>>>>>> usage is always increasing over time.
>>>>>>> <1.png><2.png><3.png>
>>>>>>> On 7 Nov 2017, at 13:34, Ufuk Celebi <uce@apache.org> wrote:
>>>>>>> Hey Ebru,
>>>>>>> let me pull in Aljoscha (CC'd) who might have an idea what's
>>>>>> causing
>>>>>>> this.
>>>>>>> Since multiple jobs are running, it will be hard to
>>>>>> understand to
>>>>>>> which job the state descriptors from the heap snapshot
>>>>>> belong to.
>>>>>>> - Is it possible to isolate the problem and reproduce the
>>>>>> behaviour
>>>>>>> with only a single job?
>>>>>>> =E2=80=93 Ufuk
>>>>>>> On Tue, Nov 7, 2017 at 10:27 AM, =C3=87ET=C4=B0NKAYA EBRU
>>>>>> =C3=87ET=C4=B0NKAYA EBRU
>>>>>>> <b20926247@cs.hacettepe.edu.tr> wrote:
>>>>>>> Hi,
>>>>>>> We are using Flink 1.3.1 in production, we have one job
>>>>>> manager and
>>>>>>> 3 task
>>>>>>> managers in standalone mode. Recently, we've noticed that we
>>>>>> have
>>>>>>> memory
>>>>>>> related problems. We use docker container to serve Flink
>>>>>> cluster. We
>>>>>>> have
>>>>>>> 300 slots and 20 jobs are running with parallelism of 10.
>>>>>> Also the
>>>>>>> job
>>>>>>> count
>>>>>>> may be change over time. Taskmanager memory usage always
>>>>>> increases.
>>>>>>> After
>>>>>>> job cancelation this memory usage doesn't decrease. We've
>>>>>> tried to
>>>>>>> investigate the problem and we've got the task manager jvm
>>>>>> heap
>>>>>>> snapshot.
>>>>>>> According to the jam heap analysis, possible memory leak was
>>>>>> Flink
>>>>>>> list
>>>>>>> state descriptor. But we are not sure that is the cause of
>>>>>> our
>>>>>>> memory
>>>>>>> problem. How can we solve the problem?
>>>>>>> We have two types of Flink job. One has no state full
>>>>>> operator
>>>>>>> contains only maps and filters and the other has time window
>>>>>> with
>>>>>>> count trigger.
>>>>>>> * We've analysed the jvm heaps again in different
>>>>>> conditions. First
>>>>>>> we analysed the snapshot when no flink jobs running on
>>>>>> cluster. (image
>>>>>>> 1)
>>>>>>> * Then, we analysed the jvm heap snapshot when the flink job
>>>>>> that has
>>>>>>> no state full operator is running. And according to the
>>>>>> results, leak
>>>>>>> suspect was NetworkBufferPool (image 2)
>>>>>>> *   Last analys, there were both two types of jobs running
>>>>>> and leak
>>>>>>> suspect was again NetworkBufferPool. (image 3)
>>>>>>> In our system jobs are regularly cancelled and resubmitted so
>>>>>> we
>>>>>>> noticed that when job is submitted some amount of memory
>>>>>> allocated and
>>>>>>> after cancelation this allocated memory never freed. So over
>>>>>> time
>>>>>>> memory usage is always increasing and exceeded the limits.
>>>>>> Links:
>>>>>> ------
>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-7845
>>>>>> Hi Piotr,
>>>>>> There are two types of jobs.
>>>>>> In first, we use Kafka source and Kafka sink, there isn't any
>>>>>> window operator.
>>>>>>> In second job, we use Kafka source, filesystem sink and
>>>>>> elastic search sink and window operator for buffering.
>>>>>> Hi Piotrek,
>>>>>> Thanks for your reply.
>>>>>> We've tested our link cluster again. We have 360 slots, and our
>>>>>> cluster configuration is like this;
>>>>>> jobmanager.rpc.address: %JOBMANAGER%
>>>>>> jobmanager.rpc.port: 6123
>>>>>> jobmanager.heap.mb: 1536
>>>>>> taskmanager.heap.mb: 1536
>>>>>> taskmanager.numberOfTaskSlots: 120
>>>>>> taskmanager.memory.preallocate: false
>>>>>> parallelism.default: 1
>>>>>> jobmanager.web.port: 8081
>>>>>> state.backend: filesystem
>>>>>> state.backend.fs.checkpointdir: file:///storage/%CHECKPOINTDIR%
>>>>>> state.checkpoints.dir: file:///storage/%CHECKPOINTDIR%
>>>>>> taskmanager.network.numberOfBuffers: 5000
>>>>>> We are using docker based Flink cluster.
>>>>>> WE submitted 36 jobs with the parallelism of 10. After all slots
>>>>>> became full. Memory usage were increasing by the time and one by =
one
>>>>>> task managers start to die. And the exception was like this;
>>>>>> Taskmanager1 log:
>>>>>> Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]
>>>>>> shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled =
for
>>>>>> ActorSystem[flink]
>>>>>> java.lang.NoClassDefFoundError:
>>>>>> org/apache/kafka/common/metrics/stats/Rate$1
>>>>>> =E2=80=82=E2=80=82at
>>>>>> org.apache.kafka.common.metrics.stats.Rate.convert(Rate.java:93)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> org.apache.kafka.common.metrics.stats.Rate.measure(Rate.java:62)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.java:61)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.java:52)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricW=
rapper.getValue(KafkaMetricWrapper.java:35)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricW=
rapper.getValue(KafkaMetricWrapper.java:26)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.serializeGau=
ge(MetricDumpSerialization.java:213)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.access$200(M=
etricDumpSerialization.java:50)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$MetricDumpSe=
rializer.serialize(MetricDumpSerialization.java:138)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricQueryService.onReceive(MetricQ=
ueryService.java:109)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:=
167)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>>> =E2=80=82=E2=80=82at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractD=
ispatcher.scala:397)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java=
:1339)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja=
va:107)
>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>> org.apache.kafka.common.metrics.stats.Rate$1
>>>>>> =E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>> =E2=80=82=E2=80=82... 22 more
>>>>>> Taskmanager2 log:
>>>>>> Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]
>>>>>> shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled =
for
>>>>>> ActorSystem[flink]
>>>>>> Java.lang.NoClassDefFoundError:
>>>>>> =
org/apache/flink/streaming/connectors/kafka/internals/AbstractFetcher$1
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$Offs=
etGauge.getValue(AbstractFetcher.java:492)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$Offs=
etGauge.getValue(AbstractFetcher.java:480)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.serializeGau=
ge(MetricDumpSerialization.java:213)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.access$200(M=
etricDumpSerialization.java:50)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$MetricDumpSe=
rializer.serialize(MetricDumpSerialization.java:138)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricQueryService.onReceive(MetricQ=
ueryService.java:109)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:=
167)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>>> =E2=80=82=E2=80=82at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractD=
ispatcher.scala:397)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java=
:1339)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>> =E2=80=82=E2=80=82at
>>>>>> =
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja=
va:107)
>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$1
>>>>>> =E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>> =E2=80=82=E2=80=82... 18 more
>>>>>> -Ebru
>>>>> Hi Piotrek,
>>>>> We attached the full log of the taskmanager1.
>>>>> This may not be a dependency issue because until all of the task =
slots is full, we didn't get any No Class Def Found exception, when =
there is available memory jobs can run without exception for days.
>>>>> Also there is Kafka Instance Already Exist exception in full log, =
but this not relevant and doesn't effect jobs or task managers.
>>>>> -Ebru<taskmanager1.log.zip>
>>> Hi,
>>> Sorry we attached wrong log file. I've attached all task managers =
and job manager's log. All task managers and job manager was =
killed.<logs.zip>
> <logs2-2.zip>


--Apple-Mail=_8241DE43-4FF9-4491-BB43-BD8AE94DB6DD
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D"">Ebru,=
 Javier, Flavio:<div class=3D""><br class=3D""></div><div class=3D"">I =
tried to reproduce memory leak by submitting a job, that was generating =
classes with random names. And indeed I have found one. Memory was =
accumulating in `char[]` instances that belonged to =
`java.lang.ClassLoader#parallelLockMap`. OldGen memory pool was growing =
in size up to the point I got:<div class=3D""><br class=3D""></div><div =
class=3D"">java.lang.OutOfMemoryError: Java heap space<br class=3D""><div =
class=3D""><br class=3D""></div><div class=3D"">This seems like an old =
known =E2=80=9Cfeature=E2=80=9D of JDK:<div class=3D""><a =
href=3D"https://bugs.openjdk.java.net/browse/JDK-8037342" =
class=3D"">https://bugs.openjdk.java.net/browse/JDK-8037342</a></div><div =
class=3D""><br class=3D""></div><div class=3D"">Can any of you confirm =
that this is the issue that you are experiencing? If not, I would really =
need more help/information from you to track this down.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Piotrek</div><div =
class=3D""><div class=3D""><div><br class=3D""><blockquote type=3D"cite" =
class=3D""><div class=3D"">On 10 Nov 2017, at 15:12, =C3=87ET=C4=B0NKAYA =
EBRU =C3=87ET=C4=B0NKAYA EBRU &lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div class=3D"">On =
2017-11-10 13:14, Piotr Nowojski wrote:<br class=3D""><blockquote =
type=3D"cite" class=3D"">jobmanager1.log and taskmanager2.log are the =
same. Can you also submit<br class=3D"">files containing std output?<br =
class=3D"">Piotrek<br class=3D""><blockquote type=3D"cite" class=3D"">On =
10 Nov 2017, at 09:35, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA EBRU =
&lt;<a href=3D"mailto:b20926247@cs.hacettepe.edu.tr" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a>&gt; wrote:<br class=3D"">On =
2017-11-10 11:04, Piotr Nowojski wrote:<br class=3D""><blockquote =
type=3D"cite" class=3D"">Hi,<br class=3D"">Thanks for the logs, however =
I do not see before mentioned exceptions<br class=3D"">in it. It ends =
with java.lang.InterruptedException<br class=3D"">Is it the correct log =
file? Also, could you attach the std output file<br class=3D"">of the =
failing TaskManager?<br class=3D"">Piotrek<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 10 Nov 2017, at 08:42, =C3=87ET=C4=B0NKAYA =
EBRU =C3=87ET=C4=B0NKAYA EBRU &lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a>&gt; wrote:<br class=3D"">On =
2017-11-09 20:08, Piotr Nowojski wrote:<br class=3D""><blockquote =
type=3D"cite" class=3D"">Hi,<br class=3D"">Could you attach full logs =
from those task managers? At first glance I<br class=3D"">don=E2=80=99t =
see a connection between those exceptions and any memory issue<br =
class=3D"">that you might had. It looks like a dependency issue in one =
(some?<br class=3D"">All?) of your jobs.<br class=3D"">Did you build =
your jars with -Pbuild-jar profile as described here:<br class=3D""><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.3/quicks=
tart/java_api_quickstart.html#build-project" =
class=3D"">https://ci.apache.org/projects/flink/flink-docs-release-1.3/qui=
ckstart/java_api_quickstart.html#build-project</a><br class=3D"">?<br =
class=3D"">If that doesn=E2=80=99t help. Can you binary search which job =
is causing the<br class=3D"">problem? There might be some Flink =
incompatibility between different<br class=3D"">versions and rebuilding =
a job=E2=80=99s jar with a version matching to the<br class=3D"">cluster =
version might help.<br class=3D"">Piotrek<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 9 Nov 2017, at 17:36, =C3=87ET=C4=B0NKAYA =
EBRU =C3=87ET=C4=B0NKAYA EBRU<br =
class=3D"">&lt;b20926247@cs.hacettepe.edu.tr&gt; wrote:<br class=3D"">On =
2017-11-08 18:30, Piotr Nowojski wrote:<br class=3D"">Btw, Ebru:<br =
class=3D"">I don=E2=80=99t agree that the main suspect is =
NetworkBufferPool. On your<br class=3D"">screenshots it=E2=80=99s memory =
consumption was reasonable and stable:<br class=3D"">596MB<br =
class=3D"">-&gt; 602MB -&gt; 597MB.<br class=3D"">PoolThreadCache memory =
usage ~120MB is also reasonable.<br class=3D"">Do you experience any =
problems, like Out Of Memory<br class=3D"">errors/crashes/long<br =
class=3D"">GC pauses? Or just JVM process is using more memory over =
time? You<br class=3D"">are<br class=3D"">aware that JVM doesn=E2=80=99t =
like to release memory back to OS once it<br class=3D"">was<br =
class=3D"">used? So increasing memory usage until hitting some limit =
(for<br class=3D"">example<br class=3D"">JVM max heap size) is expected =
behaviour.<br class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 15:48, =
Piotr Nowojski &lt;piotr@data-artisans.com&gt;<br class=3D"">wrote:<br =
class=3D"">I don=E2=80=99t know if this is relevant to this issue, but I =
was<br class=3D"">constantly getting failures trying to reproduce this =
leak using your<br class=3D"">Job, because you were using non =
deterministic getKey function:<br class=3D"">@Override<br =
class=3D"">public Integer getKey(Integer event) {<br class=3D"">Random =
randomGen =3D new Random((new Date()).getTime());<br class=3D"">return =
randomGen.nextInt() % 8;<br class=3D"">}<br class=3D"">And quoting Java =
doc of KeySelector:<br class=3D"">"If invoked multiple times on the same =
object, the returned key must<br class=3D"">be the same.=E2=80=9D<br =
class=3D"">I=E2=80=99m trying to reproduce this issue with following =
job:<br =
class=3D"">https://gist.github.com/pnowojski/b80f725c1af7668051c773438637e=
0d3<br class=3D"">Where IntegerSource is just an infinite source, =
DisardingSink is<br class=3D"">well just discarding incoming data. I=E2=80=
=99m cancelling the job every 5<br class=3D"">seconds and so far (after =
~15 minutes) my memory consumption is<br class=3D"">stable, well below =
maximum java heap size.<br class=3D"">Piotrek<br class=3D"">On 8 Nov =
2017, at 15:28, Javier Lopez &lt;javier.lopez@zalando.de&gt;<br =
class=3D"">wrote:<br class=3D"">Yes, I tested with just printing the =
stream. But it could take a<br class=3D"">lot of time to fail.<br =
class=3D"">On Wednesday, 8 November 2017, Piotr Nowojski<br =
class=3D"">&lt;piotr@data-artisans.com&gt; wrote:<br class=3D"">Thanks =
for quick answer.<br class=3D"">So it will also fail after some time =
with `fromElements` source<br class=3D"">instead of Kafka, right?<br =
class=3D"">Did you try it also without a Kafka producer?<br =
class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 14:57, Javier Lopez =
&lt;javier.lopez@zalando.de&gt;<br class=3D"">wrote:<br class=3D"">Hi,<br =
class=3D"">You don't need data. With data it will die faster. I tested =
as<br class=3D"">well with a small data set, using the fromElements =
source, but it<br class=3D"">will take some time to die. It's better =
with some data.<br class=3D"">On 8 November 2017 at 14:54, Piotr =
Nowojski<br class=3D"">&lt;piotr@data-artisans.com&gt; wrote:<br =
class=3D"">Hi,<br class=3D"">Thanks for sharing this job.<br class=3D"">Do=
 I need to feed some data to the Kafka to reproduce this<br =
class=3D""></blockquote>issue with your script?<br class=3D""><blockquote =
type=3D"cite" class=3D""><blockquote type=3D"cite" class=3D"">Does this =
OOM issue also happen when you are not using the<br =
class=3D""></blockquote></blockquote>Kafka source/sink?<br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 14:08, Javier Lopez =
&lt;javier.lopez@zalando.de&gt;<br =
class=3D""></blockquote></blockquote>wrote:<br class=3D""><blockquote =
type=3D"cite" class=3D""><blockquote type=3D"cite" class=3D"">Hi,<br =
class=3D"">This is the test flink job we created to trigger this leak<br =
class=3D""></blockquote></blockquote>https://gist.github.com/javieredo/c60=
52404dbe6cc602e99f4669a09f7d6<br class=3D""><blockquote type=3D"cite" =
class=3D""><blockquote type=3D"cite" class=3D"">And this is the python =
script we are using to execute the job<br =
class=3D""></blockquote></blockquote>thousands of times to get the OOM =
problem<br =
class=3D"">https://gist.github.com/javieredo/4825324d5d5f504e27ca6c004396a=
107<br class=3D""><blockquote type=3D"cite" class=3D""><blockquote =
type=3D"cite" class=3D"">The cluster we used for this has this =
configuration:<br class=3D"">Instance type: t2.large<br class=3D"">Number =
of workers: 2<br class=3D"">HeapMemory: 5500<br class=3D"">Number of =
task slots per node: 4<br class=3D"">TaskMangMemFraction: 0.5<br =
class=3D"">NumberOfNetworkBuffers: 2000<br class=3D"">We have tried =
several things, increasing the heap, reducing the<br =
class=3D""></blockquote></blockquote>heap, more memory fraction, changes =
this value in the<br class=3D"">taskmanager.sh =
"TM_MAX_OFFHEAP_SIZE=3D"2G"; and nothing seems to<br class=3D"">work.<br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">Thanks for your help.<br class=3D"">On 8 November 2017 at =
13:26, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA EBRU<br =
class=3D""></blockquote></blockquote>&lt;b20926247@cs.hacettepe.edu.tr&gt;=
 wrote:<br class=3D""><blockquote type=3D"cite" class=3D"">On 2017-11-08 =
15:20, Piotr Nowojski wrote:<br class=3D"">Hi Ebru and Javier,<br =
class=3D"">Yes, if you could share this example job it would be =
helpful.<br class=3D"">Ebru: could you explain in a little more details =
how does<br class=3D""></blockquote>your Job(s)<br class=3D""><blockquote =
type=3D"cite" class=3D"">look like? Could you post some code? If you are =
just using<br class=3D""></blockquote>maps and<br class=3D""><blockquote =
type=3D"cite" class=3D"">filters there shouldn=E2=80=99t be any network =
transfers involved,<br class=3D""></blockquote>aside<br =
class=3D""><blockquote type=3D"cite" class=3D"">from Source and Sink =
functions.<br class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 12:54, =
ebru<br class=3D""></blockquote>&lt;b20926247@cs.hacettepe.edu.tr&gt; =
wrote:<br class=3D""><blockquote type=3D"cite" class=3D"">Hi Javier,<br =
class=3D"">It would be helpful if you share your test job with us.<br =
class=3D"">Which configurations did you try?<br class=3D"">-Ebru<br =
class=3D"">On 8 Nov 2017, at 14:43, Javier Lopez<br =
class=3D""></blockquote>&lt;javier.lopez@zalando.de&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br =
class=3D"">Hi,<br class=3D"">We have been facing a similar problem. We =
have tried some<br class=3D""></blockquote>different<br =
class=3D""><blockquote type=3D"cite" class=3D"">configurations, as =
proposed in other email thread by Flavio<br class=3D""></blockquote>and<br=
 class=3D""><blockquote type=3D"cite" class=3D"">Kien, but it didn't =
work. We have a workaround similar to<br class=3D""></blockquote>the =
one<br class=3D""><blockquote type=3D"cite" class=3D"">that Flavio has, =
we restart the taskmanagers once they reach<br =
class=3D""></blockquote>a<br class=3D""><blockquote type=3D"cite" =
class=3D"">memory threshold. We created a small test to remove all of<br =
class=3D""></blockquote>our<br class=3D""><blockquote type=3D"cite" =
class=3D"">dependencies and leave only flink native libraries. This<br =
class=3D""></blockquote>test reads<br class=3D""><blockquote type=3D"cite"=
 class=3D"">data from a Kafka topic and writes it back to another =
topic<br class=3D""></blockquote>in<br class=3D""><blockquote =
type=3D"cite" class=3D"">Kafka. We cancel the job and start another =
every 5 seconds.<br class=3D""></blockquote>After<br =
class=3D""><blockquote type=3D"cite" class=3D"">~30 minutes of doing =
this process, the cluster reaches the<br class=3D""></blockquote>OS =
memory<br class=3D""><blockquote type=3D"cite" class=3D"">limit and =
dies.<br class=3D"">Currently, we have a test cluster with 8 workers and =
8 task<br class=3D""></blockquote>slots<br class=3D""><blockquote =
type=3D"cite" class=3D"">per node. We have one job that uses 56 slots, =
and we cannot<br class=3D""></blockquote>execute<br class=3D""><blockquote=
 type=3D"cite" class=3D"">that job 5 times in a row because the whole =
cluster dies. If<br class=3D""></blockquote>you<br class=3D""><blockquote =
type=3D"cite" class=3D"">want, we can publish our test job.<br =
class=3D"">Regards,<br class=3D"">On 8 November 2017 at 11:20, Aljoscha =
Krettek<br class=3D""></blockquote>&lt;aljoscha@apache.org&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br class=3D"">@Nico=
 &amp; @Piotr Could you please have a look at this? You<br =
class=3D""></blockquote>both<br class=3D""><blockquote type=3D"cite" =
class=3D"">recently worked on the network stack and might be most<br =
class=3D""></blockquote>familiar with<br class=3D""><blockquote =
type=3D"cite" class=3D"">this.<br class=3D"">On 8. Nov 2017, at 10:25, =
Flavio Pompermaier<br =
class=3D""></blockquote>&lt;pompermaier@okkam.it&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br class=3D"">We =
also have the same problem in production. At the moment<br =
class=3D""></blockquote>the<br class=3D""><blockquote type=3D"cite" =
class=3D"">solution is to restart the entire Flink cluster after =
every<br class=3D""></blockquote>job..<br class=3D""><blockquote =
type=3D"cite" class=3D"">We've tried to reproduce this problem with a =
test (see<br class=3D"">https://issues.apache.org/jira/browse/FLINK-7845 =
[1]) but we<br class=3D""></blockquote>don't<br class=3D""><blockquote =
type=3D"cite" class=3D"">know whether the error produced by the test and =
the leak are<br class=3D"">correlated..<br class=3D"">Best,<br =
class=3D"">Flavio<br class=3D"">On Wed, Nov 8, 2017 at 9:51 AM, =
=C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA<br =
class=3D""></blockquote>EBRU<br class=3D""><blockquote type=3D"cite" =
class=3D"">&lt;b20926247@cs.hacettepe.edu.tr&gt; wrote:<br class=3D"">On =
2017-11-07 16:53, Ufuk Celebi wrote:<br class=3D"">Do you use any =
windowing? If yes, could you please share<br class=3D""></blockquote>that =
code?<br class=3D""><blockquote type=3D"cite" class=3D"">If<br =
class=3D"">there is no stateful operation at all, it's strange where<br =
class=3D""></blockquote>the list<br class=3D""><blockquote type=3D"cite" =
class=3D"">state instances are coming from.<br class=3D"">On Tue, Nov 7, =
2017 at 2:35 PM, ebru<br =
class=3D""></blockquote>&lt;b20926247@cs.hacettepe.edu.tr&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br class=3D"">Hi =
Ufuk,<br class=3D"">We don=E2=80=99t explicitly define any state =
descriptor. We only<br class=3D""></blockquote>use map<br =
class=3D""><blockquote type=3D"cite" class=3D"">and filters<br =
class=3D"">operator. We thought that gc handle clearing the flink=E2=80=99=
s<br class=3D""></blockquote>internal<br class=3D""><blockquote =
type=3D"cite" class=3D"">states.<br class=3D"">So how can we manage the =
memory if it is always increasing?<br class=3D"">- Ebru<br class=3D"">On =
7 Nov 2017, at 16:23, Ufuk Celebi &lt;uce@apache.org&gt; wrote:<br =
class=3D"">Hey Ebru, the memory usage might be increasing as long as =
a<br class=3D""></blockquote>job is<br class=3D""><blockquote =
type=3D"cite" class=3D"">running.<br class=3D"">This is expected (also =
in the case of multiple running<br class=3D""></blockquote>jobs). The<br =
class=3D""><blockquote type=3D"cite" class=3D"">screenshots are not =
helpful in that regard. :-(<br class=3D"">What kind of stateful =
operations are you using? Depending on<br class=3D""></blockquote>your<br =
class=3D""><blockquote type=3D"cite" class=3D"">use case,<br =
class=3D"">you have to manually call `clear()` on the state instance =
in<br class=3D""></blockquote>order<br class=3D""><blockquote =
type=3D"cite" class=3D"">to<br class=3D"">release the managed state.<br =
class=3D"">Best,<br class=3D"">Ufuk<br class=3D"">On Tue, Nov 7, 2017 at =
12:43 PM, ebru<br class=3D"">&lt;b20926247@cs.hacettepe.edu.tr&gt; =
wrote:<br class=3D"">Begin forwarded message:<br class=3D"">From: ebru =
&lt;b20926247@cs.hacettepe.edu.tr&gt;<br class=3D"">Subject: Re: Flink =
memory leak<br class=3D"">Date: 7 November 2017 at 14:09:17 GMT+3<br =
class=3D"">To: Ufuk Celebi &lt;uce@apache.org&gt;<br class=3D"">Hi =
Ufuk,<br class=3D"">There are there snapshots of htop output.<br =
class=3D"">1. snapshot is initial state.<br class=3D"">2. snapshot is =
after submitted one job.<br class=3D"">3. Snapshot is the output of the =
one job with 15000 EPS. And<br class=3D""></blockquote>the<br =
class=3D""><blockquote type=3D"cite" class=3D"">memory<br class=3D"">usage=
 is always increasing over time.<br =
class=3D"">&lt;1.png&gt;&lt;2.png&gt;&lt;3.png&gt;<br class=3D"">On 7 =
Nov 2017, at 13:34, Ufuk Celebi &lt;uce@apache.org&gt; wrote:<br =
class=3D"">Hey Ebru,<br class=3D"">let me pull in Aljoscha (CC'd) who =
might have an idea what's<br class=3D""></blockquote>causing<br =
class=3D""><blockquote type=3D"cite" class=3D"">this.<br class=3D"">Since =
multiple jobs are running, it will be hard to<br =
class=3D""></blockquote>understand to<br class=3D""><blockquote =
type=3D"cite" class=3D"">which job the state descriptors from the heap =
snapshot<br class=3D""></blockquote>belong to.<br class=3D""><blockquote =
type=3D"cite" class=3D"">- Is it possible to isolate the problem and =
reproduce the<br class=3D""></blockquote>behaviour<br =
class=3D""><blockquote type=3D"cite" class=3D"">with only a single =
job?<br class=3D"">=E2=80=93 Ufuk<br class=3D"">On Tue, Nov 7, 2017 at =
10:27 AM, =C3=87ET=C4=B0NKAYA EBRU<br class=3D""></blockquote>=C3=87ET=C4=B0=
NKAYA EBRU<br class=3D""><blockquote type=3D"cite" =
class=3D"">&lt;b20926247@cs.hacettepe.edu.tr&gt; wrote:<br =
class=3D"">Hi,<br class=3D"">We are using Flink 1.3.1 in production, we =
have one job<br class=3D""></blockquote>manager and<br =
class=3D""><blockquote type=3D"cite" class=3D"">3 task<br =
class=3D"">managers in standalone mode. Recently, we've noticed that =
we<br class=3D""></blockquote>have<br class=3D""><blockquote type=3D"cite"=
 class=3D"">memory<br class=3D"">related problems. We use docker =
container to serve Flink<br class=3D""></blockquote>cluster. We<br =
class=3D""><blockquote type=3D"cite" class=3D"">have<br class=3D"">300 =
slots and 20 jobs are running with parallelism of 10.<br =
class=3D""></blockquote>Also the<br class=3D""><blockquote type=3D"cite" =
class=3D"">job<br class=3D"">count<br class=3D"">may be change over =
time. Taskmanager memory usage always<br =
class=3D""></blockquote>increases.<br class=3D""><blockquote type=3D"cite"=
 class=3D"">After<br class=3D"">job cancelation this memory usage =
doesn't decrease. We've<br class=3D""></blockquote>tried to<br =
class=3D""><blockquote type=3D"cite" class=3D"">investigate the problem =
and we've got the task manager jvm<br class=3D""></blockquote>heap<br =
class=3D""><blockquote type=3D"cite" class=3D"">snapshot.<br =
class=3D"">According to the jam heap analysis, possible memory leak =
was<br class=3D""></blockquote>Flink<br class=3D""><blockquote =
type=3D"cite" class=3D"">list<br class=3D"">state descriptor. But we are =
not sure that is the cause of<br class=3D""></blockquote>our<br =
class=3D""><blockquote type=3D"cite" class=3D"">memory<br =
class=3D"">problem. How can we solve the problem?<br class=3D"">We have =
two types of Flink job. One has no state full<br =
class=3D""></blockquote>operator<br class=3D""><blockquote type=3D"cite" =
class=3D"">contains only maps and filters and the other has time =
window<br class=3D""></blockquote>with<br class=3D""><blockquote =
type=3D"cite" class=3D"">count trigger.<br class=3D"">* We've analysed =
the jvm heaps again in different<br class=3D""></blockquote>conditions. =
First<br class=3D""><blockquote type=3D"cite" class=3D"">we analysed the =
snapshot when no flink jobs running on<br class=3D""></blockquote>cluster.=
 (image<br class=3D""><blockquote type=3D"cite" class=3D"">1)<br =
class=3D"">* Then, we analysed the jvm heap snapshot when the flink =
job<br class=3D""></blockquote>that has<br class=3D""><blockquote =
type=3D"cite" class=3D"">no state full operator is running. And =
according to the<br class=3D""></blockquote>results, leak<br =
class=3D""><blockquote type=3D"cite" class=3D"">suspect was =
NetworkBufferPool (image 2)<br class=3D"">* &nbsp;&nbsp;Last analys, =
there were both two types of jobs running<br class=3D""></blockquote>and =
leak<br class=3D""><blockquote type=3D"cite" class=3D"">suspect was =
again NetworkBufferPool. (image 3)<br class=3D"">In our system jobs are =
regularly cancelled and resubmitted so<br class=3D""></blockquote>we<br =
class=3D""><blockquote type=3D"cite" class=3D"">noticed that when job is =
submitted some amount of memory<br class=3D""></blockquote>allocated =
and<br class=3D""><blockquote type=3D"cite" class=3D"">after cancelation =
this allocated memory never freed. So over<br =
class=3D""></blockquote>time<br class=3D""><blockquote type=3D"cite" =
class=3D"">memory usage is always increasing and exceeded the limits.<br =
class=3D""></blockquote>Links:<br class=3D"">------<br class=3D"">[1] =
https://issues.apache.org/jira/browse/FLINK-7845<br class=3D"">Hi =
Piotr,<br class=3D"">There are two types of jobs.<br class=3D"">In =
first, we use Kafka source and Kafka sink, there isn't any<br =
class=3D"">window operator.<br class=3D""><blockquote type=3D"cite" =
class=3D"">In second job, we use Kafka source, filesystem sink and<br =
class=3D""></blockquote>elastic search sink and window operator for =
buffering.<br class=3D"">Hi Piotrek,<br class=3D"">Thanks for your =
reply.<br class=3D"">We've tested our link cluster again. We have 360 =
slots, and our<br class=3D"">cluster configuration is like this;<br =
class=3D"">jobmanager.rpc.address: %JOBMANAGER%<br =
class=3D"">jobmanager.rpc.port: 6123<br class=3D"">jobmanager.heap.mb: =
1536<br class=3D"">taskmanager.heap.mb: 1536<br =
class=3D"">taskmanager.numberOfTaskSlots: 120<br =
class=3D"">taskmanager.memory.preallocate: false<br =
class=3D"">parallelism.default: 1<br class=3D"">jobmanager.web.port: =
8081<br class=3D"">state.backend: filesystem<br =
class=3D"">state.backend.fs.checkpointdir: =
file:///storage/%CHECKPOINTDIR%<br class=3D"">state.checkpoints.dir: =
file:///storage/%CHECKPOINTDIR%<br =
class=3D"">taskmanager.network.numberOfBuffers: 5000<br class=3D"">We =
are using docker based Flink cluster.<br class=3D"">WE submitted 36 jobs =
with the parallelism of 10. After all slots<br class=3D"">became full. =
Memory usage were increasing by the time and one by one<br class=3D"">task=
 managers start to die. And the exception was like this;<br =
class=3D"">Taskmanager1 log:<br class=3D"">Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]<br class=3D"">shutting down JVM =
since 'akka.jvm-exit-on-fatal-error' is enabled for<br =
class=3D"">ActorSystem[flink]<br =
class=3D"">java.lang.NoClassDefFoundError:<br =
class=3D"">org/apache/kafka/common/metrics/stats/Rate$1<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metrics.stats.Rate.convert(Rate.java:93=
)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metrics.stats.Rate.measure(Rate.java:62=
)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.j=
ava:61)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.j=
ava:52)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.connectors.kafka.internals.metrics.K=
afkaMetricWrapper.getValue(KafkaMetricWrapper.java:35)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.connectors.kafka.internals.metrics.K=
afkaMetricWrapper.getValue(KafkaMetricWrapper.java:26)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.s=
erializeGauge(MetricDumpSerialization.java:213)<br class=3D"">=E2=80=82=E2=
=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.a=
ccess$200(MetricDumpSerialization.java:50)<br class=3D"">=E2=80=82=E2=80=82=
at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$M=
etricDumpSerializer.serialize(MetricDumpSerialization.java:138)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricQueryService.onRece=
ive(MetricQueryService.java:109)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedA=
ctor.scala:167)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)<br class=3D"">=E2=80=82=E2=
=80=82at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mailbox.scala:220)<br class=3D"">=E2=80=82=E2=80=
=82at<br =
class=3D"">akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exe=
c(AbstractDispatcher.scala:397)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java=
:260)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJo=
inPool.java:1339)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.j=
ava:1979)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWork=
erThread.java:107)<br class=3D"">Caused by: =
java.lang.ClassNotFoundException:<br =
class=3D"">org.apache.kafka.common.metrics.stats.Rate$1<br =
class=3D"">=E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)<br =
class=3D"">=E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)<br =
class=3D"">=E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)<br =
class=3D"">=E2=80=82=E2=80=82... 22 more<br class=3D"">Taskmanager2 =
log:<br class=3D"">Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]<br class=3D"">shutting down JVM =
since 'akka.jvm-exit-on-fatal-error' is enabled for<br =
class=3D"">ActorSystem[flink]<br =
class=3D"">Java.lang.NoClassDefFoundError:<br =
class=3D"">org/apache/flink/streaming/connectors/kafka/internals/AbstractF=
etcher$1<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.connectors.kafka.internals.AbstractF=
etcher$OffsetGauge.getValue(AbstractFetcher.java:492)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.connectors.kafka.internals.AbstractF=
etcher$OffsetGauge.getValue(AbstractFetcher.java:480)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.s=
erializeGauge(MetricDumpSerialization.java:213)<br class=3D"">=E2=80=82=E2=
=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.a=
ccess$200(MetricDumpSerialization.java:50)<br class=3D"">=E2=80=82=E2=80=82=
at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$M=
etricDumpSerializer.serialize(MetricDumpSerialization.java:138)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metrics.dump.MetricQueryService.onRece=
ive(MetricQueryService.java:109)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedA=
ctor.scala:167)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)<br class=3D"">=E2=80=82=E2=
=80=82at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)<br =
class=3D"">=E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mailbox.scala:220)<br class=3D"">=E2=80=82=E2=80=
=82at<br =
class=3D"">akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exe=
c(AbstractDispatcher.scala:397)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java=
:260)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJo=
inPool.java:1339)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.j=
ava:1979)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWork=
erThread.java:107)<br class=3D"">Caused by: =
java.lang.ClassNotFoundException:<br =
class=3D"">org.apache.flink.streaming.connectors.kafka.internals.AbstractF=
etcher$1<br class=3D"">=E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)<br =
class=3D"">=E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)<br =
class=3D"">=E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)<br =
class=3D"">=E2=80=82=E2=80=82... 18 more<br class=3D"">-Ebru<br =
class=3D""></blockquote>Hi Piotrek,<br class=3D"">We attached the full =
log of the taskmanager1.<br class=3D"">This may not be a dependency =
issue because until all of the task slots is full, we didn't get any No =
Class Def Found exception, when there is available memory jobs can run =
without exception for days.<br class=3D"">Also there is Kafka Instance =
Already Exist exception in full log, but this not relevant and doesn't =
effect jobs or task managers.<br =
class=3D"">-Ebru&lt;taskmanager1.log.zip&gt;<br =
class=3D""></blockquote></blockquote>Hi,<br class=3D"">Sorry we attached =
wrong log file. I've attached all task managers and job manager's log. =
All task managers and job manager was killed.&lt;logs.zip&gt;<br =
class=3D""></blockquote></blockquote><span =
id=3D"cid:B60F2B45-6212-409F-B4BB-6819130C758E@fritz.box">&lt;logs2-2.zip&=
gt;</span></div></div></blockquote></div><br =
class=3D""></div></div></div></div></div></body></html>=

--Apple-Mail=_8241DE43-4FF9-4491-BB43-BD8AE94DB6DD--