Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Piotr Nowojski <piotr@data-artisans.com>
Message-Id: <6B9DE3BA-AD1A-4FF2-8A08-A6EC69A5E2AD@data-artisans.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_19B75350-BBB9-4565-85CC-8C27952F5BAF"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: Flink memory leak
Date: Tue, 14 Nov 2017 17:29:28 +0100
In-Reply-To: <CAELUF_CnMVOD_cPwy03j-3ScsfBSE3iRvFa4irz_ySihmbfuyA@mail.gmail.com>
Cc: =?utf-8?B?w4dFVMSwTktBWUEgRUJSVSDDh0VUxLBOS0FZQSBFQlJV?= <b20926247@cs.hacettepe.edu.tr>,
 Javier Lopez <javier.lopez@zalando.de>,
 Aljoscha Krettek <aljoscha@apache.org>,
 Nico Kruber <nico@data-artisans.com>,
 Ufuk Celebi <uce@apache.org>,
 user <user@flink.apache.org>
To: Flavio Pompermaier <pompermaier@okkam.it>
References: <58D3AB1E-821D-46A2-8F04-8BB708037C2B@cs.hacettepe.edu.tr>
 <EF1606F3-6C62-4B4D-8942-CA9326CBDE7D@cs.hacettepe.edu.tr>
 <CAKiyyaE9Q+NYiCTdzOztFA4gmWS0tF64kvpqFrDuhKD-d1_AQA@mail.gmail.com>
 <30C290C1-3FFD-479D-A926-861B110638BD@cs.hacettepe.edu.tr>
 <CAKiyyaFoQH5FG_dQ85Uy=kHvsnMFSipUeG956uKbpqS00e8Obw@mail.gmail.com>
 <8c5facf8eda9ed6197c9e1c36099f8bb@cs.hacettepe.edu.tr>
 <CAELUF_AK_Ju9uc7tWt6WPHHJAm6eqZOmhftPZmsfO_SP6kS_nw@mail.gmail.com>
 <BEF55744-31EC-4842-ACC0-6BA7165220A5@apache.org>
 <CANq+ctq3FoQLe6xnVbSchZ_3NM130+qp1RH11pmM7C81LYoAvw@mail.gmail.com>
 <D13E5B00-B50A-4C51-A4C9-FDA38D362F3B@cs.hacettepe.edu.tr>
 <B337B5FB-E19C-4BD7-B569-3D1776F8AB16@data-artisans.com>
 <e3a3d62555d2f4d26170d4e3af1fce16@cs.hacettepe.edu.tr>
 <CANq+ctopgE_Z=fvU+BwK0LFMqcn-d4CCB7PGi8KirB=PjtQ5gw@mail.gmail.com>
 <14DD60D2-6E14-4DF3-836C-0A712F68B8C2@data-artisans.com>
 <CANq+ctq2nPGC+mVgdwzcHfD5THxHvm6BPWvA4Td=xa-55JWnfA@mail.gmail.com>
 <95582EE2-9F90-43B8-8B3A-71A027A5DB1B@data-artisans.com>
 <CANq+ctrnGMb6vGocNkNi6vUpQAseEivpT1DfGX3qn3cM8GDoEQ@mail.gmail.com>
 <B3A92502-A391-4F78-82E7-78323B5A645A@data-artisans.com>
 <972D0323-A628-45C6-AD1B-654EBD95AD96@data-artisans.com>
 <0019c4f2caf07e0fbde7a73ab7361002@cs.hacettepe.edu.tr>
 <9021B6B4-8A8D-4DBF-9C24-AC1F06956A4C@data-artisans.com>
 <c8a545df4fc632e8edcd4d0b1c5267ea@cs.hacettepe.edu.tr>
 <C594067D-31C6-44DF-B5CF-7017127D0DD6@data-artisans.com>
 <57a785fab1f98ee12060d42adad186ff@cs.hacettepe.edu.tr>
 <633C1FF0-9221-4689-9401-B3708DA305F4@data-artisans.com>
 <3edbdfc7da76ff2c2deecc4280f2168a@cs.hacettepe.edu.tr>
 <9E44A461-061E-4474-AE1E-9E9D70DA23BD@data-artisans.com>
 <CAELUF_CnMVOD_cPwy03j-3ScsfBSE3iRvFa4irz_ySihmbfuyA@mail.gmail.com>
archived-at: Tue, 14 Nov 2017 16:29:40 -0000


--Apple-Mail=_19B75350-BBB9-4565-85CC-8C27952F5BAF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Best would be to analyse memory usage via some profiler. What I have =
done was:

1. Run your scenario on the test cluster until memory consumption goes =
up
2. Stop submitting new jobs, cancel or running jobs
3. Manually triggered GC couple of times via jconsole (other tools can =
do that as well)
4. Analyse memory consumption via:
A) Oracle=E2=80=99s Mission Control (Java Mission Control, jmc)
  - analyse memory consumption and check which memory pool is growing =
(OldGen heap? Metaspace? Code Cache? Non heap?)
  - run flight record with checked all memory options
  - check which objects were using a lot of memory=20
B) VisualVM
  - take heap dump and analyse what is using up all of this memory
C) jconsole
  - this can tell you memory pool status of you JVMs, but will not tell =
you what objects are actually exhausting the pools

Couple of remarks:
- because of GC memory usage can goes up and down. Important is the =
trend of local minimums measured just after manually triggered GC
- you might have to repeat steps 2, 3, 4 to actually see what has =
increased between submitting the jobs
- by default network buffers are using 10% of heap space in byte[], so =
you can ignore those
- this JDK bug that I have reproduced was visible by huge memory =
consumption of multiple char[] and ConcurrentHashMap$Node instances  =20

Piotrek

> On 14 Nov 2017, at 16:08, Flavio Pompermaier <pompermaier@okkam.it> =
wrote:
>=20
> What should we do to confirm it? Do you have any github repo start =
from?
>=20
> On Tue, Nov 14, 2017 at 4:02 PM, Piotr Nowojski =
<piotr@data-artisans.com <mailto:piotr@data-artisans.com>> wrote:
> Ebru, Javier, Flavio:
>=20
> I tried to reproduce memory leak by submitting a job, that was =
generating classes with random names. And indeed I have found one. =
Memory was accumulating in `char[]` instances that belonged to =
`java.lang.ClassLoader#parallelLockMap`. OldGen memory pool was growing =
in size up to the point I got:
>=20
> java.lang.OutOfMemoryError: Java heap space
>=20
> This seems like an old known =E2=80=9Cfeature=E2=80=9D of JDK:
> https://bugs.openjdk.java.net/browse/JDK-8037342 =
<https://bugs.openjdk.java.net/browse/JDK-8037342>
>=20
> Can any of you confirm that this is the issue that you are =
experiencing? If not, I would really need more help/information from you =
to track this down.
>=20
> Piotrek
>=20
>> On 10 Nov 2017, at 15:12, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA =
EBRU <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>=20
>> On 2017-11-10 13:14, Piotr Nowojski wrote:
>>> jobmanager1.log and taskmanager2.log are the same. Can you also =
submit
>>> files containing std output?
>>> Piotrek
>>>> On 10 Nov 2017, at 09:35, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAY=
A EBRU <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>> On 2017-11-10 11:04, Piotr Nowojski wrote:
>>>>> Hi,
>>>>> Thanks for the logs, however I do not see before mentioned =
exceptions
>>>>> in it. It ends with java.lang.InterruptedException
>>>>> Is it the correct log file? Also, could you attach the std output =
file
>>>>> of the failing TaskManager?
>>>>> Piotrek
>>>>>> On 10 Nov 2017, at 08:42, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NK=
AYA EBRU <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>> On 2017-11-09 20:08, Piotr Nowojski wrote:
>>>>>>> Hi,
>>>>>>> Could you attach full logs from those task managers? At first =
glance I
>>>>>>> don=E2=80=99t see a connection between those exceptions and any =
memory issue
>>>>>>> that you might had. It looks like a dependency issue in one =
(some?
>>>>>>> All?) of your jobs.
>>>>>>> Did you build your jars with -Pbuild-jar profile as described =
here:
>>>>>>> =
https://ci.apache.org/projects/flink/flink-docs-release-1.3/quickstart/jav=
a_api_quickstart.html#build-project =
<https://ci.apache.org/projects/flink/flink-docs-release-1.3/quickstart/ja=
va_api_quickstart.html#build-project>
>>>>>>> ?
>>>>>>> If that doesn=E2=80=99t help. Can you binary search which job is =
causing the
>>>>>>> problem? There might be some Flink incompatibility between =
different
>>>>>>> versions and rebuilding a job=E2=80=99s jar with a version =
matching to the
>>>>>>> cluster version might help.
>>>>>>> Piotrek
>>>>>>>> On 9 Nov 2017, at 17:36, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0N=
KAYA EBRU
>>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> On 2017-11-08 18:30, Piotr Nowojski wrote:
>>>>>>>> Btw, Ebru:
>>>>>>>> I don=E2=80=99t agree that the main suspect is =
NetworkBufferPool. On your
>>>>>>>> screenshots it=E2=80=99s memory consumption was reasonable and =
stable:
>>>>>>>> 596MB
>>>>>>>> -> 602MB -> 597MB.
>>>>>>>> PoolThreadCache memory usage ~120MB is also reasonable.
>>>>>>>> Do you experience any problems, like Out Of Memory
>>>>>>>> errors/crashes/long
>>>>>>>> GC pauses? Or just JVM process is using more memory over time? =
You
>>>>>>>> are
>>>>>>>> aware that JVM doesn=E2=80=99t like to release memory back to =
OS once it
>>>>>>>> was
>>>>>>>> used? So increasing memory usage until hitting some limit (for
>>>>>>>> example
>>>>>>>> JVM max heap size) is expected behaviour.
>>>>>>>> Piotrek
>>>>>>>> On 8 Nov 2017, at 15:48, Piotr Nowojski =
<piotr@data-artisans.com <mailto:piotr@data-artisans.com>>
>>>>>>>> wrote:
>>>>>>>> I don=E2=80=99t know if this is relevant to this issue, but I =
was
>>>>>>>> constantly getting failures trying to reproduce this leak using =
your
>>>>>>>> Job, because you were using non deterministic getKey function:
>>>>>>>> @Override
>>>>>>>> public Integer getKey(Integer event) {
>>>>>>>> Random randomGen =3D new Random((new Date()).getTime());
>>>>>>>> return randomGen.nextInt() % 8;
>>>>>>>> }
>>>>>>>> And quoting Java doc of KeySelector:
>>>>>>>> "If invoked multiple times on the same object, the returned key =
must
>>>>>>>> be the same.=E2=80=9D
>>>>>>>> I=E2=80=99m trying to reproduce this issue with following job:
>>>>>>>> =
https://gist.github.com/pnowojski/b80f725c1af7668051c773438637e0d3 =
<https://gist.github.com/pnowojski/b80f725c1af7668051c773438637e0d3>
>>>>>>>> Where IntegerSource is just an infinite source, DisardingSink =
is
>>>>>>>> well just discarding incoming data. I=E2=80=99m cancelling the =
job every 5
>>>>>>>> seconds and so far (after ~15 minutes) my memory consumption is
>>>>>>>> stable, well below maximum java heap size.
>>>>>>>> Piotrek
>>>>>>>> On 8 Nov 2017, at 15:28, Javier Lopez <javier.lopez@zalando.de =
<mailto:javier.lopez@zalando.de>>
>>>>>>>> wrote:
>>>>>>>> Yes, I tested with just printing the stream. But it could take =
a
>>>>>>>> lot of time to fail.
>>>>>>>> On Wednesday, 8 November 2017, Piotr Nowojski
>>>>>>>> <piotr@data-artisans.com <mailto:piotr@data-artisans.com>> =
wrote:
>>>>>>>> Thanks for quick answer.
>>>>>>>> So it will also fail after some time with `fromElements` source
>>>>>>>> instead of Kafka, right?
>>>>>>>> Did you try it also without a Kafka producer?
>>>>>>>> Piotrek
>>>>>>>> On 8 Nov 2017, at 14:57, Javier Lopez <javier.lopez@zalando.de =
<mailto:javier.lopez@zalando.de>>
>>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> You don't need data. With data it will die faster. I tested as
>>>>>>>> well with a small data set, using the fromElements source, but =
it
>>>>>>>> will take some time to die. It's better with some data.
>>>>>>>> On 8 November 2017 at 14:54, Piotr Nowojski
>>>>>>>> <piotr@data-artisans.com <mailto:piotr@data-artisans.com>> =
wrote:
>>>>>>>> Hi,
>>>>>>>> Thanks for sharing this job.
>>>>>>>> Do I need to feed some data to the Kafka to reproduce this
>>>>>>> issue with your script?
>>>>>>>>> Does this OOM issue also happen when you are not using the
>>>>>>> Kafka source/sink?
>>>>>>>>> Piotrek
>>>>>>>>> On 8 Nov 2017, at 14:08, Javier Lopez <javier.lopez@zalando.de =
<mailto:javier.lopez@zalando.de>>
>>>>>>> wrote:
>>>>>>>>> Hi,
>>>>>>>>> This is the test flink job we created to trigger this leak
>>>>>>> =
https://gist.github.com/javieredo/c6052404dbe6cc602e99f4669a09f7d6 =
<https://gist.github.com/javieredo/c6052404dbe6cc602e99f4669a09f7d6>
>>>>>>>>> And this is the python script we are using to execute the job
>>>>>>> thousands of times to get the OOM problem
>>>>>>> =
https://gist.github.com/javieredo/4825324d5d5f504e27ca6c004396a107 =
<https://gist.github.com/javieredo/4825324d5d5f504e27ca6c004396a107>
>>>>>>>>> The cluster we used for this has this configuration:
>>>>>>>>> Instance type: t2.large
>>>>>>>>> Number of workers: 2
>>>>>>>>> HeapMemory: 5500
>>>>>>>>> Number of task slots per node: 4
>>>>>>>>> TaskMangMemFraction: 0.5
>>>>>>>>> NumberOfNetworkBuffers: 2000
>>>>>>>>> We have tried several things, increasing the heap, reducing =
the
>>>>>>> heap, more memory fraction, changes this value in the
>>>>>>> taskmanager.sh "TM_MAX_OFFHEAP_SIZE=3D"2G"; and nothing seems to
>>>>>>> work.
>>>>>>>>> Thanks for your help.
>>>>>>>>> On 8 November 2017 at 13:26, =C3=87ET=C4=B0NKAYA EBRU =
=C3=87ET=C4=B0NKAYA EBRU
>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> On 2017-11-08 15:20, Piotr Nowojski wrote:
>>>>>>>> Hi Ebru and Javier,
>>>>>>>> Yes, if you could share this example job it would be helpful.
>>>>>>>> Ebru: could you explain in a little more details how does
>>>>>>> your Job(s)
>>>>>>>> look like? Could you post some code? If you are just using
>>>>>>> maps and
>>>>>>>> filters there shouldn=E2=80=99t be any network transfers =
involved,
>>>>>>> aside
>>>>>>>> from Source and Sink functions.
>>>>>>>> Piotrek
>>>>>>>> On 8 Nov 2017, at 12:54, ebru
>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> Hi Javier,
>>>>>>>> It would be helpful if you share your test job with us.
>>>>>>>> Which configurations did you try?
>>>>>>>> -Ebru
>>>>>>>> On 8 Nov 2017, at 14:43, Javier Lopez
>>>>>>> <javier.lopez@zalando.de <mailto:javier.lopez@zalando.de>>
>>>>>>>> wrote:
>>>>>>>> Hi,
>>>>>>>> We have been facing a similar problem. We have tried some
>>>>>>> different
>>>>>>>> configurations, as proposed in other email thread by Flavio
>>>>>>> and
>>>>>>>> Kien, but it didn't work. We have a workaround similar to
>>>>>>> the one
>>>>>>>> that Flavio has, we restart the taskmanagers once they reach
>>>>>>> a
>>>>>>>> memory threshold. We created a small test to remove all of
>>>>>>> our
>>>>>>>> dependencies and leave only flink native libraries. This
>>>>>>> test reads
>>>>>>>> data from a Kafka topic and writes it back to another topic
>>>>>>> in
>>>>>>>> Kafka. We cancel the job and start another every 5 seconds.
>>>>>>> After
>>>>>>>> ~30 minutes of doing this process, the cluster reaches the
>>>>>>> OS memory
>>>>>>>> limit and dies.
>>>>>>>> Currently, we have a test cluster with 8 workers and 8 task
>>>>>>> slots
>>>>>>>> per node. We have one job that uses 56 slots, and we cannot
>>>>>>> execute
>>>>>>>> that job 5 times in a row because the whole cluster dies. If
>>>>>>> you
>>>>>>>> want, we can publish our test job.
>>>>>>>> Regards,
>>>>>>>> On 8 November 2017 at 11:20, Aljoscha Krettek
>>>>>>> <aljoscha@apache.org <mailto:aljoscha@apache.org>>
>>>>>>>> wrote:
>>>>>>>> @Nico & @Piotr Could you please have a look at this? You
>>>>>>> both
>>>>>>>> recently worked on the network stack and might be most
>>>>>>> familiar with
>>>>>>>> this.
>>>>>>>> On 8. Nov 2017, at 10:25, Flavio Pompermaier
>>>>>>> <pompermaier@okkam.it <mailto:pompermaier@okkam.it>>
>>>>>>>> wrote:
>>>>>>>> We also have the same problem in production. At the moment
>>>>>>> the
>>>>>>>> solution is to restart the entire Flink cluster after every
>>>>>>> job..
>>>>>>>> We've tried to reproduce this problem with a test (see
>>>>>>>> https://issues.apache.org/jira/browse/FLINK-7845 =
<https://issues.apache.org/jira/browse/FLINK-7845> [1]) but we
>>>>>>> don't
>>>>>>>> know whether the error produced by the test and the leak are
>>>>>>>> correlated..
>>>>>>>> Best,
>>>>>>>> Flavio
>>>>>>>> On Wed, Nov 8, 2017 at 9:51 AM, =C3=87ET=C4=B0NKAYA EBRU =
=C3=87ET=C4=B0NKAYA
>>>>>>> EBRU
>>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> On 2017-11-07 16:53, Ufuk Celebi wrote:
>>>>>>>> Do you use any windowing? If yes, could you please share
>>>>>>> that code?
>>>>>>>> If
>>>>>>>> there is no stateful operation at all, it's strange where
>>>>>>> the list
>>>>>>>> state instances are coming from.
>>>>>>>> On Tue, Nov 7, 2017 at 2:35 PM, ebru
>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>>
>>>>>>>> wrote:
>>>>>>>> Hi Ufuk,
>>>>>>>> We don=E2=80=99t explicitly define any state descriptor. We =
only
>>>>>>> use map
>>>>>>>> and filters
>>>>>>>> operator. We thought that gc handle clearing the flink=E2=80=99s
>>>>>>> internal
>>>>>>>> states.
>>>>>>>> So how can we manage the memory if it is always increasing?
>>>>>>>> - Ebru
>>>>>>>> On 7 Nov 2017, at 16:23, Ufuk Celebi <uce@apache.org =
<mailto:uce@apache.org>> wrote:
>>>>>>>> Hey Ebru, the memory usage might be increasing as long as a
>>>>>>> job is
>>>>>>>> running.
>>>>>>>> This is expected (also in the case of multiple running
>>>>>>> jobs). The
>>>>>>>> screenshots are not helpful in that regard. :-(
>>>>>>>> What kind of stateful operations are you using? Depending on
>>>>>>> your
>>>>>>>> use case,
>>>>>>>> you have to manually call `clear()` on the state instance in
>>>>>>> order
>>>>>>>> to
>>>>>>>> release the managed state.
>>>>>>>> Best,
>>>>>>>> Ufuk
>>>>>>>> On Tue, Nov 7, 2017 at 12:43 PM, ebru
>>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> Begin forwarded message:
>>>>>>>> From: ebru <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>>
>>>>>>>> Subject: Re: Flink memory leak
>>>>>>>> Date: 7 November 2017 at 14:09:17 GMT+3
>>>>>>>> To: Ufuk Celebi <uce@apache.org <mailto:uce@apache.org>>
>>>>>>>> Hi Ufuk,
>>>>>>>> There are there snapshots of htop output.
>>>>>>>> 1. snapshot is initial state.
>>>>>>>> 2. snapshot is after submitted one job.
>>>>>>>> 3. Snapshot is the output of the one job with 15000 EPS. And
>>>>>>> the
>>>>>>>> memory
>>>>>>>> usage is always increasing over time.
>>>>>>>> <1.png><2.png><3.png>
>>>>>>>> On 7 Nov 2017, at 13:34, Ufuk Celebi <uce@apache.org =
<mailto:uce@apache.org>> wrote:
>>>>>>>> Hey Ebru,
>>>>>>>> let me pull in Aljoscha (CC'd) who might have an idea what's
>>>>>>> causing
>>>>>>>> this.
>>>>>>>> Since multiple jobs are running, it will be hard to
>>>>>>> understand to
>>>>>>>> which job the state descriptors from the heap snapshot
>>>>>>> belong to.
>>>>>>>> - Is it possible to isolate the problem and reproduce the
>>>>>>> behaviour
>>>>>>>> with only a single job?
>>>>>>>> =E2=80=93 Ufuk
>>>>>>>> On Tue, Nov 7, 2017 at 10:27 AM, =C3=87ET=C4=B0NKAYA EBRU
>>>>>>> =C3=87ET=C4=B0NKAYA EBRU
>>>>>>>> <b20926247@cs.hacettepe.edu.tr =
<mailto:b20926247@cs.hacettepe.edu.tr>> wrote:
>>>>>>>> Hi,
>>>>>>>> We are using Flink 1.3.1 in production, we have one job
>>>>>>> manager and
>>>>>>>> 3 task
>>>>>>>> managers in standalone mode. Recently, we've noticed that we
>>>>>>> have
>>>>>>>> memory
>>>>>>>> related problems. We use docker container to serve Flink
>>>>>>> cluster. We
>>>>>>>> have
>>>>>>>> 300 slots and 20 jobs are running with parallelism of 10.
>>>>>>> Also the
>>>>>>>> job
>>>>>>>> count
>>>>>>>> may be change over time. Taskmanager memory usage always
>>>>>>> increases.
>>>>>>>> After
>>>>>>>> job cancelation this memory usage doesn't decrease. We've
>>>>>>> tried to
>>>>>>>> investigate the problem and we've got the task manager jvm
>>>>>>> heap
>>>>>>>> snapshot.
>>>>>>>> According to the jam heap analysis, possible memory leak was
>>>>>>> Flink
>>>>>>>> list
>>>>>>>> state descriptor. But we are not sure that is the cause of
>>>>>>> our
>>>>>>>> memory
>>>>>>>> problem. How can we solve the problem?
>>>>>>>> We have two types of Flink job. One has no state full
>>>>>>> operator
>>>>>>>> contains only maps and filters and the other has time window
>>>>>>> with
>>>>>>>> count trigger.
>>>>>>>> * We've analysed the jvm heaps again in different
>>>>>>> conditions. First
>>>>>>>> we analysed the snapshot when no flink jobs running on
>>>>>>> cluster. (image
>>>>>>>> 1)
>>>>>>>> * Then, we analysed the jvm heap snapshot when the flink job
>>>>>>> that has
>>>>>>>> no state full operator is running. And according to the
>>>>>>> results, leak
>>>>>>>> suspect was NetworkBufferPool (image 2)
>>>>>>>> *   Last analys, there were both two types of jobs running
>>>>>>> and leak
>>>>>>>> suspect was again NetworkBufferPool. (image 3)
>>>>>>>> In our system jobs are regularly cancelled and resubmitted so
>>>>>>> we
>>>>>>>> noticed that when job is submitted some amount of memory
>>>>>>> allocated and
>>>>>>>> after cancelation this allocated memory never freed. So over
>>>>>>> time
>>>>>>>> memory usage is always increasing and exceeded the limits.
>>>>>>> Links:
>>>>>>> ------
>>>>>>> [1] https://issues.apache.org/jira/browse/FLINK-7845 =
<https://issues.apache.org/jira/browse/FLINK-7845>
>>>>>>> Hi Piotr,
>>>>>>> There are two types of jobs.
>>>>>>> In first, we use Kafka source and Kafka sink, there isn't any
>>>>>>> window operator.
>>>>>>>> In second job, we use Kafka source, filesystem sink and
>>>>>>> elastic search sink and window operator for buffering.
>>>>>>> Hi Piotrek,
>>>>>>> Thanks for your reply.
>>>>>>> We've tested our link cluster again. We have 360 slots, and our
>>>>>>> cluster configuration is like this;
>>>>>>> jobmanager.rpc.address: %JOBMANAGER%
>>>>>>> jobmanager.rpc.port: 6123
>>>>>>> jobmanager.heap.mb: 1536
>>>>>>> taskmanager.heap.mb: 1536
>>>>>>> taskmanager.numberOfTaskSlots: 120
>>>>>>> taskmanager.memory.preallocate: false
>>>>>>> parallelism.default: 1
>>>>>>> jobmanager.web.port: 8081
>>>>>>> state.backend: filesystem
>>>>>>> state.backend.fs.checkpointdir: file:///storage/%CHECKPOINTDIR%
>>>>>>> state.checkpoints.dir: file:///storage/%CHECKPOINTDIR%
>>>>>>> taskmanager.network.numberOfBuffers: 5000
>>>>>>> We are using docker based Flink cluster.
>>>>>>> WE submitted 36 jobs with the parallelism of 10. After all slots
>>>>>>> became full. Memory usage were increasing by the time and one by =
one
>>>>>>> task managers start to die. And the exception was like this;
>>>>>>> Taskmanager1 log:
>>>>>>> Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]
>>>>>>> shutting down JVM since 'akka.jvm-exit-on-fatal-error' is =
enabled for
>>>>>>> ActorSystem[flink]
>>>>>>> java.lang.NoClassDefFoundError:
>>>>>>> org/apache/kafka/common/metrics/stats/Rate$1
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> org.apache.kafka.common.metrics.stats.Rate.convert(Rate.java:93)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> org.apache.kafka.common.metrics.stats.Rate.measure(Rate.java:62)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.java:61)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.kafka.common.metrics.KafkaMetric.value(KafkaMetric.java:52)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricW=
rapper.getValue(KafkaMetricWrapper.java:35)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.metrics.KafkaMetricW=
rapper.getValue(KafkaMetricWrapper.java:26)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.serializeGau=
ge(MetricDumpSerialization.java:213)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.access$200(M=
etricDumpSerialization.java:50)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$MetricDumpSe=
rializer.serialize(MetricDumpSerialization.java:138)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricQueryService.onReceive(MetricQ=
ueryService.java:109)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:=
167)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractD=
ispatcher.scala:397)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java=
:1339)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja=
va:107)
>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>> org.apache.kafka.common.metrics.stats.Rate$1
>>>>>>> =E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>> =E2=80=82=E2=80=82... 22 more
>>>>>>> Taskmanager2 log:
>>>>>>> Uncaught error from thread =
[flink-akka.actor.default-dispatcher-17]
>>>>>>> shutting down JVM since 'akka.jvm-exit-on-fatal-error' is =
enabled for
>>>>>>> ActorSystem[flink]
>>>>>>> Java.lang.NoClassDefFoundError:
>>>>>>> =
org/apache/flink/streaming/connectors/kafka/internals/AbstractFetcher$1
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$Offs=
etGauge.getValue(AbstractFetcher.java:492)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$Offs=
etGauge.getValue(AbstractFetcher.java:480)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.serializeGau=
ge(MetricDumpSerialization.java:213)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization.access$200(M=
etricDumpSerialization.java:50)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricDumpSerialization$MetricDumpSe=
rializer.serialize(MetricDumpSerialization.java:138)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
org.apache.flink.runtime.metrics.dump.MetricQueryService.onReceive(MetricQ=
ueryService.java:109)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:=
167)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>>>>> =E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>>>>> =E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractD=
ispatcher.scala:397)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java=
:1339)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>> =E2=80=82=E2=80=82at
>>>>>>> =
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja=
va:107)
>>>>>>> Caused by: java.lang.ClassNotFoundException:
>>>>>>> =
org.apache.flink.streaming.connectors.kafka.internals.AbstractFetcher$1
>>>>>>> =E2=80=82=E2=80=82at =
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>>>>>>> =E2=80=82=E2=80=82at =
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>>>>> =E2=80=82=E2=80=82... 18 more
>>>>>>> -Ebru
>>>>>> Hi Piotrek,
>>>>>> We attached the full log of the taskmanager1.
>>>>>> This may not be a dependency issue because until all of the task =
slots is full, we didn't get any No Class Def Found exception, when =
there is available memory jobs can run without exception for days.
>>>>>> Also there is Kafka Instance Already Exist exception in full log, =
but this not relevant and doesn't effect jobs or task managers.
>>>>>> -Ebru<taskmanager1.log.zip>
>>>> Hi,
>>>> Sorry we attached wrong log file. I've attached all task managers =
and job manager's log. All task managers and job manager was =
killed.<logs.zip>
>> <logs2-2.zip>
>=20
>=20
>=20
>=20
> --=20
> Flavio Pompermaier
> Development Department
>=20
> OKKAM S.r.l.
> Tel. +(39) 0461 041809 <tel:+39%200461%20041809>

--Apple-Mail=_19B75350-BBB9-4565-85CC-8C27952F5BAF
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" class=3D"">Best =
would be to analyse memory usage via some profiler. What I have done =
was:<div class=3D""><br class=3D""></div><div class=3D"">1. Run your =
scenario on the test cluster until memory consumption goes up</div><div =
class=3D"">2. Stop submitting new jobs, cancel or running jobs</div><div =
class=3D"">3. Manually triggered GC couple of times via jconsole (other =
tools can do that as well)</div><div class=3D"">4. Analyse memory =
consumption via:</div><div class=3D"">A) Oracle=E2=80=99s Mission =
Control (Java Mission Control, jmc)</div><div class=3D"">&nbsp; - =
analyse memory consumption and check which memory pool is growing =
(OldGen heap? Metaspace? Code Cache? Non heap?)</div><div =
class=3D"">&nbsp; - run flight record with checked all memory =
options</div><div class=3D"">&nbsp; - check which objects were using a =
lot of memory&nbsp;</div><div class=3D"">B) VisualVM</div><div =
class=3D"">&nbsp; - take heap dump and analyse what is using up all of =
this memory</div><div class=3D"">C) jconsole</div><div class=3D"">&nbsp; =
- this can tell you memory pool status of you JVMs, but will not tell =
you what objects are actually exhausting the pools</div><div =
class=3D""><br class=3D""></div><div class=3D"">Couple of =
remarks:</div><div class=3D"">- because of GC memory usage can goes up =
and down. Important is the trend of local minimums measured just after =
manually triggered GC</div><div class=3D"">- you might have to repeat =
steps 2, 3, 4 to actually see what has increased between submitting the =
jobs</div><div class=3D"">- by default network buffers are using 10% of =
heap space in byte[], so you can ignore those</div><div class=3D"">- =
this JDK bug that I have reproduced was visible by huge memory =
consumption of multiple char[] and ConcurrentHashMap$Node instances =
&nbsp;&nbsp;</div><div class=3D""><br class=3D""></div><div =
class=3D"">Piotrek</div><div class=3D""><br class=3D""></div><div =
class=3D""><div><blockquote type=3D"cite" class=3D""><div class=3D"">On =
14 Nov 2017, at 16:08, Flavio Pompermaier &lt;<a =
href=3D"mailto:pompermaier@okkam.it" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">What should we do to confirm it? Do you have any github repo =
start from?<div class=3D"gmail_extra"><br class=3D""><div =
class=3D"gmail_quote">On Tue, Nov 14, 2017 at 4:02 PM, Piotr Nowojski =
<span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:piotr@data-artisans.com" target=3D"_blank" =
class=3D"">piotr@data-artisans.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
style=3D"word-wrap:break-word;line-break:after-white-space" =
class=3D"">Ebru, Javier, Flavio:<div class=3D""><br class=3D""></div><div =
class=3D"">I tried to reproduce memory leak by submitting a job, that =
was generating classes with random names. And indeed I have found one. =
Memory was accumulating in `char[]` instances that belonged to =
`java.lang.ClassLoader#paralle<wbr class=3D"">lLockMap`. OldGen memory =
pool was growing in size up to the point I got:<div class=3D""><br =
class=3D""></div><div class=3D"">java.lang.OutOfMemoryError: Java heap =
space<br class=3D""><div class=3D""><br class=3D""></div><div =
class=3D"">This seems like an old known =E2=80=9Cfeature=E2=80=9D of =
JDK:<div class=3D""><a =
href=3D"https://bugs.openjdk.java.net/browse/JDK-8037342" =
target=3D"_blank" class=3D"">https://bugs.openjdk.java.net/<wbr =
class=3D"">browse/JDK-8037342</a></div><div class=3D""><br =
class=3D""></div><div class=3D"">Can any of you confirm that this is the =
issue that you are experiencing? If not, I would really need more =
help/information from you to track this down.</div><div class=3D""><br =
class=3D""></div><div class=3D"">Piotrek</div><div class=3D""><div =
class=3D""><div class=3D""><br class=3D""><blockquote type=3D"cite" =
class=3D""><div class=3D"">On 10 Nov 2017, at 15:12, =C3=87ET=C4=B0NKAYA =
EBRU =C3=87ET=C4=B0NKAYA EBRU &lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:</div><br =
class=3D"m_-8448316545207664169m_-5355136006477235139Apple-interchange-new=
line"><div class=3D""><div class=3D"">On 2017-11-10 13:14, Piotr =
Nowojski wrote:<br class=3D""><blockquote type=3D"cite" =
class=3D"">jobmanager1.log and taskmanager2.log are the same. Can you =
also submit<br class=3D"">files containing std output?<br =
class=3D"">Piotrek<br class=3D""><blockquote type=3D"cite" class=3D"">On =
10 Nov 2017, at 09:35, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA EBRU =
&lt;<a href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">On 2017-11-10 11:04, Piotr Nowojski wrote:<br =
class=3D""><blockquote type=3D"cite" class=3D"">Hi,<br class=3D"">Thanks =
for the logs, however I do not see before mentioned exceptions<br =
class=3D"">in it. It ends with java.lang.InterruptedException<br =
class=3D"">Is it the correct log file? Also, could you attach the std =
output file<br class=3D"">of the failing TaskManager?<br =
class=3D"">Piotrek<br class=3D""><blockquote type=3D"cite" class=3D"">On =
10 Nov 2017, at 08:42, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA EBRU =
&lt;<a href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">On 2017-11-09 20:08, Piotr Nowojski wrote:<br =
class=3D""><blockquote type=3D"cite" class=3D"">Hi,<br class=3D"">Could =
you attach full logs from those task managers? At first glance I<br =
class=3D"">don=E2=80=99t see a connection between those exceptions and =
any memory issue<br class=3D"">that you might had. It looks like a =
dependency issue in one (some?<br class=3D"">All?) of your jobs.<br =
class=3D"">Did you build your jars with -Pbuild-jar profile as described =
here:<br class=3D""><a =
href=3D"https://ci.apache.org/projects/flink/flink-docs-release-1.3/quicks=
tart/java_api_quickstart.html#build-project" target=3D"_blank" =
class=3D"">https://ci.apache.org/projects<wbr =
class=3D"">/flink/flink-docs-release-1.3/<wbr =
class=3D"">quickstart/java_api_<wbr =
class=3D"">quickstart.html#build-project</a><br class=3D"">?<br =
class=3D"">If that doesn=E2=80=99t help. Can you binary search which job =
is causing the<br class=3D"">problem? There might be some Flink =
incompatibility between different<br class=3D"">versions and rebuilding =
a job=E2=80=99s jar with a version matching to the<br class=3D"">cluster =
version might help.<br class=3D"">Piotrek<br class=3D""><blockquote =
type=3D"cite" class=3D"">On 9 Nov 2017, at 17:36, =C3=87ET=C4=B0NKAYA =
EBRU =C3=87ET=C4=B0NKAYA EBRU<br class=3D"">&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">On 2017-11-08 18:30, Piotr Nowojski wrote:<br =
class=3D"">Btw, Ebru:<br class=3D"">I don=E2=80=99t agree that the main =
suspect is NetworkBufferPool. On your<br class=3D"">screenshots it=E2=80=99=
s memory consumption was reasonable and stable:<br class=3D"">596MB<br =
class=3D"">-&gt; 602MB -&gt; 597MB.<br class=3D"">PoolThreadCache memory =
usage ~120MB is also reasonable.<br class=3D"">Do you experience any =
problems, like Out Of Memory<br class=3D"">errors/crashes/long<br =
class=3D"">GC pauses? Or just JVM process is using more memory over =
time? You<br class=3D"">are<br class=3D"">aware that JVM doesn=E2=80=99t =
like to release memory back to OS once it<br class=3D"">was<br =
class=3D"">used? So increasing memory usage until hitting some limit =
(for<br class=3D"">example<br class=3D"">JVM max heap size) is expected =
behaviour.<br class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 15:48, =
Piotr Nowojski &lt;<a href=3D"mailto:piotr@data-artisans.com" =
target=3D"_blank" class=3D"">piotr@data-artisans.com</a>&gt;<br =
class=3D"">wrote:<br class=3D"">I don=E2=80=99t know if this is relevant =
to this issue, but I was<br class=3D"">constantly getting failures =
trying to reproduce this leak using your<br class=3D"">Job, because you =
were using non deterministic getKey function:<br class=3D"">@Override<br =
class=3D"">public Integer getKey(Integer event) {<br class=3D"">Random =
randomGen =3D new Random((new Date()).getTime());<br class=3D"">return =
randomGen.nextInt() % 8;<br class=3D"">}<br class=3D"">And quoting Java =
doc of KeySelector:<br class=3D"">"If invoked multiple times on the same =
object, the returned key must<br class=3D"">be the same.=E2=80=9D<br =
class=3D"">I=E2=80=99m trying to reproduce this issue with following =
job:<br class=3D""><a =
href=3D"https://gist.github.com/pnowojski/b80f725c1af7668051c773438637e0d3=
" target=3D"_blank" class=3D"">https://gist.github.com/pnowoj<wbr =
class=3D"">ski/b80f725c1af7668051c7734386<wbr class=3D"">37e0d3</a><br =
class=3D"">Where IntegerSource is just an infinite source, DisardingSink =
is<br class=3D"">well just discarding incoming data. I=E2=80=99m =
cancelling the job every 5<br class=3D"">seconds and so far (after ~15 =
minutes) my memory consumption is<br class=3D"">stable, well below =
maximum java heap size.<br class=3D"">Piotrek<br class=3D"">On 8 Nov =
2017, at 15:28, Javier Lopez &lt;<a =
href=3D"mailto:javier.lopez@zalando.de" target=3D"_blank" =
class=3D"">javier.lopez@zalando.de</a>&gt;<br class=3D"">wrote:<br =
class=3D"">Yes, I tested with just printing the stream. But it could =
take a<br class=3D"">lot of time to fail.<br class=3D"">On Wednesday, 8 =
November 2017, Piotr Nowojski<br class=3D"">&lt;<a =
href=3D"mailto:piotr@data-artisans.com" target=3D"_blank" =
class=3D"">piotr@data-artisans.com</a>&gt; wrote:<br class=3D"">Thanks =
for quick answer.<br class=3D"">So it will also fail after some time =
with `fromElements` source<br class=3D"">instead of Kafka, right?<br =
class=3D"">Did you try it also without a Kafka producer?<br =
class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 14:57, Javier Lopez =
&lt;<a href=3D"mailto:javier.lopez@zalando.de" target=3D"_blank" =
class=3D"">javier.lopez@zalando.de</a>&gt;<br class=3D"">wrote:<br =
class=3D"">Hi,<br class=3D"">You don't need data. With data it will die =
faster. I tested as<br class=3D"">well with a small data set, using the =
fromElements source, but it<br class=3D"">will take some time to die. =
It's better with some data.<br class=3D"">On 8 November 2017 at 14:54, =
Piotr Nowojski<br class=3D"">&lt;<a =
href=3D"mailto:piotr@data-artisans.com" target=3D"_blank" =
class=3D"">piotr@data-artisans.com</a>&gt; wrote:<br class=3D"">Hi,<br =
class=3D"">Thanks for sharing this job.<br class=3D"">Do I need to feed =
some data to the Kafka to reproduce this<br class=3D""></blockquote>issue =
with your script?<br class=3D""><blockquote type=3D"cite" =
class=3D""><blockquote type=3D"cite" class=3D"">Does this OOM issue also =
happen when you are not using the<br =
class=3D""></blockquote></blockquote>Kafka source/sink?<br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 14:08, Javier Lopez =
&lt;<a href=3D"mailto:javier.lopez@zalando.de" target=3D"_blank" =
class=3D"">javier.lopez@zalando.de</a>&gt;<br =
class=3D""></blockquote></blockquote>wrote:<br class=3D""><blockquote =
type=3D"cite" class=3D""><blockquote type=3D"cite" class=3D"">Hi,<br =
class=3D"">This is the test flink job we created to trigger this leak<br =
class=3D""></blockquote></blockquote><a =
href=3D"https://gist.github.com/javieredo/c6052404dbe6cc602e99f4669a09f7d6=
" target=3D"_blank" class=3D"">https://gist.github.com/javier<wbr =
class=3D"">edo/c6052404dbe6cc602e99f4669a<wbr class=3D"">09f7d6</a><br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">And this is the python script we are using to execute the =
job<br class=3D""></blockquote></blockquote>thousands of times to get =
the OOM problem<br class=3D""><a =
href=3D"https://gist.github.com/javieredo/4825324d5d5f504e27ca6c004396a107=
" target=3D"_blank" class=3D"">https://gist.github.com/javier<wbr =
class=3D"">edo/4825324d5d5f504e27ca6c0043<wbr class=3D"">96a107</a><br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">The cluster we used for this has this configuration:<br =
class=3D"">Instance type: t2.large<br class=3D"">Number of workers: 2<br =
class=3D"">HeapMemory: 5500<br class=3D"">Number of task slots per node: =
4<br class=3D"">TaskMangMemFraction: 0.5<br =
class=3D"">NumberOfNetworkBuffers: 2000<br class=3D"">We have tried =
several things, increasing the heap, reducing the<br =
class=3D""></blockquote></blockquote>heap, more memory fraction, changes =
this value in the<br class=3D"">taskmanager.sh =
"TM_MAX_OFFHEAP_SIZE=3D"2G"; and nothing seems to<br class=3D"">work.<br =
class=3D""><blockquote type=3D"cite" class=3D""><blockquote type=3D"cite" =
class=3D"">Thanks for your help.<br class=3D"">On 8 November 2017 at =
13:26, =C3=87ET=C4=B0NKAYA EBRU =C3=87ET=C4=B0NKAYA EBRU<br =
class=3D""></blockquote></blockquote>&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D""><blockquote type=3D"cite" class=3D"">On 2017-11-08 =
15:20, Piotr Nowojski wrote:<br class=3D"">Hi Ebru and Javier,<br =
class=3D"">Yes, if you could share this example job it would be =
helpful.<br class=3D"">Ebru: could you explain in a little more details =
how does<br class=3D""></blockquote>your Job(s)<br class=3D""><blockquote =
type=3D"cite" class=3D"">look like? Could you post some code? If you are =
just using<br class=3D""></blockquote>maps and<br class=3D""><blockquote =
type=3D"cite" class=3D"">filters there shouldn=E2=80=99t be any network =
transfers involved,<br class=3D""></blockquote>aside<br =
class=3D""><blockquote type=3D"cite" class=3D"">from Source and Sink =
functions.<br class=3D"">Piotrek<br class=3D"">On 8 Nov 2017, at 12:54, =
ebru<br class=3D""></blockquote>&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D""><blockquote type=3D"cite" class=3D"">Hi Javier,<br =
class=3D"">It would be helpful if you share your test job with us.<br =
class=3D"">Which configurations did you try?<br class=3D"">-Ebru<br =
class=3D"">On 8 Nov 2017, at 14:43, Javier Lopez<br =
class=3D""></blockquote>&lt;<a href=3D"mailto:javier.lopez@zalando.de" =
target=3D"_blank" class=3D"">javier.lopez@zalando.de</a>&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br =
class=3D"">Hi,<br class=3D"">We have been facing a similar problem. We =
have tried some<br class=3D""></blockquote>different<br =
class=3D""><blockquote type=3D"cite" class=3D"">configurations, as =
proposed in other email thread by Flavio<br class=3D""></blockquote>and<br=
 class=3D""><blockquote type=3D"cite" class=3D"">Kien, but it didn't =
work. We have a workaround similar to<br class=3D""></blockquote>the =
one<br class=3D""><blockquote type=3D"cite" class=3D"">that Flavio has, =
we restart the taskmanagers once they reach<br =
class=3D""></blockquote>a<br class=3D""><blockquote type=3D"cite" =
class=3D"">memory threshold. We created a small test to remove all of<br =
class=3D""></blockquote>our<br class=3D""><blockquote type=3D"cite" =
class=3D"">dependencies and leave only flink native libraries. This<br =
class=3D""></blockquote>test reads<br class=3D""><blockquote type=3D"cite"=
 class=3D"">data from a Kafka topic and writes it back to another =
topic<br class=3D""></blockquote>in<br class=3D""><blockquote =
type=3D"cite" class=3D"">Kafka. We cancel the job and start another =
every 5 seconds.<br class=3D""></blockquote>After<br =
class=3D""><blockquote type=3D"cite" class=3D"">~30 minutes of doing =
this process, the cluster reaches the<br class=3D""></blockquote>OS =
memory<br class=3D""><blockquote type=3D"cite" class=3D"">limit and =
dies.<br class=3D"">Currently, we have a test cluster with 8 workers and =
8 task<br class=3D""></blockquote>slots<br class=3D""><blockquote =
type=3D"cite" class=3D"">per node. We have one job that uses 56 slots, =
and we cannot<br class=3D""></blockquote>execute<br class=3D""><blockquote=
 type=3D"cite" class=3D"">that job 5 times in a row because the whole =
cluster dies. If<br class=3D""></blockquote>you<br class=3D""><blockquote =
type=3D"cite" class=3D"">want, we can publish our test job.<br =
class=3D"">Regards,<br class=3D"">On 8 November 2017 at 11:20, Aljoscha =
Krettek<br class=3D""></blockquote>&lt;<a =
href=3D"mailto:aljoscha@apache.org" target=3D"_blank" =
class=3D"">aljoscha@apache.org</a>&gt;<br class=3D""><blockquote =
type=3D"cite" class=3D"">wrote:<br class=3D"">@Nico &amp; @Piotr Could =
you please have a look at this? You<br class=3D""></blockquote>both<br =
class=3D""><blockquote type=3D"cite" class=3D"">recently worked on the =
network stack and might be most<br class=3D""></blockquote>familiar =
with<br class=3D""><blockquote type=3D"cite" class=3D"">this.<br =
class=3D"">On 8. Nov 2017, at 10:25, Flavio Pompermaier<br =
class=3D""></blockquote>&lt;<a href=3D"mailto:pompermaier@okkam.it" =
target=3D"_blank" class=3D"">pompermaier@okkam.it</a>&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br class=3D"">We =
also have the same problem in production. At the moment<br =
class=3D""></blockquote>the<br class=3D""><blockquote type=3D"cite" =
class=3D"">solution is to restart the entire Flink cluster after =
every<br class=3D""></blockquote>job..<br class=3D""><blockquote =
type=3D"cite" class=3D"">We've tried to reproduce this problem with a =
test (see<br class=3D""><a =
href=3D"https://issues.apache.org/jira/browse/FLINK-7845" =
target=3D"_blank" class=3D"">https://issues.apache.org/jira<wbr =
class=3D"">/browse/FLINK-7845</a> [1]) but we<br =
class=3D""></blockquote>don't<br class=3D""><blockquote type=3D"cite" =
class=3D"">know whether the error produced by the test and the leak =
are<br class=3D"">correlated..<br class=3D"">Best,<br class=3D"">Flavio<br=
 class=3D"">On Wed, Nov 8, 2017 at 9:51 AM, =C3=87ET=C4=B0NKAYA EBRU =
=C3=87ET=C4=B0NKAYA<br class=3D""></blockquote>EBRU<br =
class=3D""><blockquote type=3D"cite" class=3D"">&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">On 2017-11-07 16:53, Ufuk Celebi wrote:<br =
class=3D"">Do you use any windowing? If yes, could you please share<br =
class=3D""></blockquote>that code?<br class=3D""><blockquote type=3D"cite"=
 class=3D"">If<br class=3D"">there is no stateful operation at all, it's =
strange where<br class=3D""></blockquote>the list<br =
class=3D""><blockquote type=3D"cite" class=3D"">state instances are =
coming from.<br class=3D"">On Tue, Nov 7, 2017 at 2:35 PM, ebru<br =
class=3D""></blockquote>&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt;<br =
class=3D""><blockquote type=3D"cite" class=3D"">wrote:<br class=3D"">Hi =
Ufuk,<br class=3D"">We don=E2=80=99t explicitly define any state =
descriptor. We only<br class=3D""></blockquote>use map<br =
class=3D""><blockquote type=3D"cite" class=3D"">and filters<br =
class=3D"">operator. We thought that gc handle clearing the flink=E2=80=99=
s<br class=3D""></blockquote>internal<br class=3D""><blockquote =
type=3D"cite" class=3D"">states.<br class=3D"">So how can we manage the =
memory if it is always increasing?<br class=3D"">- Ebru<br class=3D"">On =
7 Nov 2017, at 16:23, Ufuk Celebi &lt;<a href=3D"mailto:uce@apache.org" =
target=3D"_blank" class=3D"">uce@apache.org</a>&gt; wrote:<br =
class=3D"">Hey Ebru, the memory usage might be increasing as long as =
a<br class=3D""></blockquote>job is<br class=3D""><blockquote =
type=3D"cite" class=3D"">running.<br class=3D"">This is expected (also =
in the case of multiple running<br class=3D""></blockquote>jobs). The<br =
class=3D""><blockquote type=3D"cite" class=3D"">screenshots are not =
helpful in that regard. :-(<br class=3D"">What kind of stateful =
operations are you using? Depending on<br class=3D""></blockquote>your<br =
class=3D""><blockquote type=3D"cite" class=3D"">use case,<br =
class=3D"">you have to manually call `clear()` on the state instance =
in<br class=3D""></blockquote>order<br class=3D""><blockquote =
type=3D"cite" class=3D"">to<br class=3D"">release the managed state.<br =
class=3D"">Best,<br class=3D"">Ufuk<br class=3D"">On Tue, Nov 7, 2017 at =
12:43 PM, ebru<br class=3D"">&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">Begin forwarded message:<br class=3D"">From: ebru =
&lt;<a href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt;<br =
class=3D"">Subject: Re: Flink memory leak<br class=3D"">Date: 7 November =
2017 at 14:09:17 GMT+3<br class=3D"">To: Ufuk Celebi &lt;<a =
href=3D"mailto:uce@apache.org" target=3D"_blank" =
class=3D"">uce@apache.org</a>&gt;<br class=3D"">Hi Ufuk,<br =
class=3D"">There are there snapshots of htop output.<br class=3D"">1. =
snapshot is initial state.<br class=3D"">2. snapshot is after submitted =
one job.<br class=3D"">3. Snapshot is the output of the one job with =
15000 EPS. And<br class=3D""></blockquote>the<br class=3D""><blockquote =
type=3D"cite" class=3D"">memory<br class=3D"">usage is always increasing =
over time.<br class=3D"">&lt;1.png&gt;&lt;2.png&gt;&lt;3.png&gt;<br =
class=3D"">On 7 Nov 2017, at 13:34, Ufuk Celebi &lt;<a =
href=3D"mailto:uce@apache.org" target=3D"_blank" =
class=3D"">uce@apache.org</a>&gt; wrote:<br class=3D"">Hey Ebru,<br =
class=3D"">let me pull in Aljoscha (CC'd) who might have an idea =
what's<br class=3D""></blockquote>causing<br class=3D""><blockquote =
type=3D"cite" class=3D"">this.<br class=3D"">Since multiple jobs are =
running, it will be hard to<br class=3D""></blockquote>understand to<br =
class=3D""><blockquote type=3D"cite" class=3D"">which job the state =
descriptors from the heap snapshot<br class=3D""></blockquote>belong =
to.<br class=3D""><blockquote type=3D"cite" class=3D"">- Is it possible =
to isolate the problem and reproduce the<br =
class=3D""></blockquote>behaviour<br class=3D""><blockquote type=3D"cite" =
class=3D"">with only a single job?<br class=3D"">=E2=80=93 Ufuk<br =
class=3D"">On Tue, Nov 7, 2017 at 10:27 AM, =C3=87ET=C4=B0NKAYA EBRU<br =
class=3D""></blockquote>=C3=87ET=C4=B0NKAYA EBRU<br class=3D""><blockquote=
 type=3D"cite" class=3D"">&lt;<a =
href=3D"mailto:b20926247@cs.hacettepe.edu.tr" target=3D"_blank" =
class=3D"">b20926247@cs.hacettepe.edu.tr</a><wbr class=3D"">&gt; =
wrote:<br class=3D"">Hi,<br class=3D"">We are using Flink 1.3.1 in =
production, we have one job<br class=3D""></blockquote>manager and<br =
class=3D""><blockquote type=3D"cite" class=3D"">3 task<br =
class=3D"">managers in standalone mode. Recently, we've noticed that =
we<br class=3D""></blockquote>have<br class=3D""><blockquote type=3D"cite"=
 class=3D"">memory<br class=3D"">related problems. We use docker =
container to serve Flink<br class=3D""></blockquote>cluster. We<br =
class=3D""><blockquote type=3D"cite" class=3D"">have<br class=3D"">300 =
slots and 20 jobs are running with parallelism of 10.<br =
class=3D""></blockquote>Also the<br class=3D""><blockquote type=3D"cite" =
class=3D"">job<br class=3D"">count<br class=3D"">may be change over =
time. Taskmanager memory usage always<br =
class=3D""></blockquote>increases.<br class=3D""><blockquote type=3D"cite"=
 class=3D"">After<br class=3D"">job cancelation this memory usage =
doesn't decrease. We've<br class=3D""></blockquote>tried to<br =
class=3D""><blockquote type=3D"cite" class=3D"">investigate the problem =
and we've got the task manager jvm<br class=3D""></blockquote>heap<br =
class=3D""><blockquote type=3D"cite" class=3D"">snapshot.<br =
class=3D"">According to the jam heap analysis, possible memory leak =
was<br class=3D""></blockquote>Flink<br class=3D""><blockquote =
type=3D"cite" class=3D"">list<br class=3D"">state descriptor. But we are =
not sure that is the cause of<br class=3D""></blockquote>our<br =
class=3D""><blockquote type=3D"cite" class=3D"">memory<br =
class=3D"">problem. How can we solve the problem?<br class=3D"">We have =
two types of Flink job. One has no state full<br =
class=3D""></blockquote>operator<br class=3D""><blockquote type=3D"cite" =
class=3D"">contains only maps and filters and the other has time =
window<br class=3D""></blockquote>with<br class=3D""><blockquote =
type=3D"cite" class=3D"">count trigger.<br class=3D"">* We've analysed =
the jvm heaps again in different<br class=3D""></blockquote>conditions. =
First<br class=3D""><blockquote type=3D"cite" class=3D"">we analysed the =
snapshot when no flink jobs running on<br class=3D""></blockquote>cluster.=
 (image<br class=3D""><blockquote type=3D"cite" class=3D"">1)<br =
class=3D"">* Then, we analysed the jvm heap snapshot when the flink =
job<br class=3D""></blockquote>that has<br class=3D""><blockquote =
type=3D"cite" class=3D"">no state full operator is running. And =
according to the<br class=3D""></blockquote>results, leak<br =
class=3D""><blockquote type=3D"cite" class=3D"">suspect was =
NetworkBufferPool (image 2)<br class=3D"">* &nbsp;&nbsp;Last analys, =
there were both two types of jobs running<br class=3D""></blockquote>and =
leak<br class=3D""><blockquote type=3D"cite" class=3D"">suspect was =
again NetworkBufferPool. (image 3)<br class=3D"">In our system jobs are =
regularly cancelled and resubmitted so<br class=3D""></blockquote>we<br =
class=3D""><blockquote type=3D"cite" class=3D"">noticed that when job is =
submitted some amount of memory<br class=3D""></blockquote>allocated =
and<br class=3D""><blockquote type=3D"cite" class=3D"">after cancelation =
this allocated memory never freed. So over<br =
class=3D""></blockquote>time<br class=3D""><blockquote type=3D"cite" =
class=3D"">memory usage is always increasing and exceeded the limits.<br =
class=3D""></blockquote>Links:<br class=3D"">------<br class=3D"">[1] <a =
href=3D"https://issues.apache.org/jira/browse/FLINK-7845" =
target=3D"_blank" class=3D"">https://issues.apache.org/jira<wbr =
class=3D"">/browse/FLINK-7845</a><br class=3D"">Hi Piotr,<br =
class=3D"">There are two types of jobs.<br class=3D"">In first, we use =
Kafka source and Kafka sink, there isn't any<br class=3D"">window =
operator.<br class=3D""><blockquote type=3D"cite" class=3D"">In second =
job, we use Kafka source, filesystem sink and<br =
class=3D""></blockquote>elastic search sink and window operator for =
buffering.<br class=3D"">Hi Piotrek,<br class=3D"">Thanks for your =
reply.<br class=3D"">We've tested our link cluster again. We have 360 =
slots, and our<br class=3D"">cluster configuration is like this;<br =
class=3D"">jobmanager.rpc.address: %JOBMANAGER%<br =
class=3D"">jobmanager.rpc.port: 6123<br class=3D"">jobmanager.heap.mb: =
1536<br class=3D"">taskmanager.heap.mb: 1536<br =
class=3D"">taskmanager.numberOfTaskSlots: 120<br =
class=3D"">taskmanager.memory.preallocate<wbr class=3D"">: false<br =
class=3D"">parallelism.default: 1<br class=3D"">jobmanager.web.port: =
8081<br class=3D"">state.backend: filesystem<br =
class=3D"">state.backend.fs.checkpointdir<wbr class=3D"">: <a =
href=3D"file:///storage/%CHECKPOINTDIR" =
class=3D"">file:///storage/%CHECKPOINTDIR</a><wbr class=3D"">%<br =
class=3D"">state.checkpoints.dir: <a =
href=3D"file:///storage/%CHECKPOINTDIR" =
class=3D"">file:///storage/%CHECKPOINTDIR</a><wbr class=3D"">%<br =
class=3D"">taskmanager.network.numberOfBu<wbr class=3D"">ffers: 5000<br =
class=3D"">We are using docker based Flink cluster.<br class=3D"">WE =
submitted 36 jobs with the parallelism of 10. After all slots<br =
class=3D"">became full. Memory usage were increasing by the time and one =
by one<br class=3D"">task managers start to die. And the exception was =
like this;<br class=3D"">Taskmanager1 log:<br class=3D"">Uncaught error =
from thread [flink-akka.actor.default-disp<wbr class=3D"">atcher-17]<br =
class=3D"">shutting down JVM since 'akka.jvm-exit-on-fatal-error' is =
enabled for<br class=3D"">ActorSystem[flink]<br =
class=3D"">java.lang.NoClassDefFoundError<wbr class=3D"">:<br =
class=3D"">org/apache/kafka/common/metric<wbr class=3D"">s/stats/Rate$1<br=
 class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metric<wbr =
class=3D"">s.stats.Rate.convert(Rate.<wbr class=3D"">java:93)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metric<wbr =
class=3D"">s.stats.Rate.measure(Rate.<wbr class=3D"">java:62)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metric<wbr =
class=3D"">s.KafkaMetric.value(KafkaMetri<wbr class=3D"">c.java:61)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.kafka.common.metric<wbr =
class=3D"">s.KafkaMetric.value(KafkaMetri<wbr class=3D"">c.java:52)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.con<wbr =
class=3D"">nectors.kafka.internals.metric<wbr =
class=3D"">s.KafkaMetricWrapper.getValue(<wbr =
class=3D"">KafkaMetricWrapper.java:35)<br class=3D"">=E2=80=82=E2=80=82at<=
br class=3D"">org.apache.flink.streaming.con<wbr =
class=3D"">nectors.kafka.internals.metric<wbr =
class=3D"">s.KafkaMetricWrapper.getValue(<wbr =
class=3D"">KafkaMetricWrapper.java:26)<br class=3D"">=E2=80=82=E2=80=82at<=
br class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n.serializeGauge(MetricDumpSer<wbr =
class=3D"">ialization.java:213)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n.access$200(MetricDumpSeriali<wbr =
class=3D"">zation.java:50)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n$MetricDumpSerializer.seriali<wbr =
class=3D"">ze(MetricDumpSerialization.<wbr class=3D"">java:138)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricQueryService.<wbr =
class=3D"">onReceive(MetricQueryService.<wbr class=3D"">java:109)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.actor.UntypedActor$$anonf<wbr =
class=3D"">un$receive$1.applyOrElse(Untyp<wbr =
class=3D"">edActor.scala:167)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundR<wbr class=3D"">eceive(Actor.scala:467)<br =
class=3D"">=E2=80=82=E2=80=82at akka.actor.UntypedActor.around<wbr =
class=3D"">Receive(UntypedActor.scala:97)<br class=3D"">=E2=80=82=E2=80=82=
at akka.actor.ActorCell.receiveMe<wbr =
class=3D"">ssage(ActorCell.scala:516)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(Ac<wbr class=3D"">torCell.scala:487)<br =
class=3D"">=E2=80=82=E2=80=82at akka.dispatch.Mailbox.processM<wbr =
class=3D"">ailbox(Mailbox.scala:238)<br class=3D"">=E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mail<wbr class=3D"">box.scala:220)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.dispatch.ForkJoinExecutor<wbr =
class=3D"">Configurator$AkkaForkJoinTask.<wbr =
class=3D"">exec(AbstractDispatcher.scala:<wbr class=3D"">397)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinTask.doExec(ForkJoinTask.<wbr class=3D"">java:260)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinPool$WorkQueue.runTask(<wbr =
class=3D"">ForkJoinPool.java:1339)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinPool.runWorker(ForkJoinPoo<wbr class=3D"">l.java:1979)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinWorkerThread.run(ForkJoinW<wbr =
class=3D"">orkerThread.java:107)<br class=3D"">Caused by: =
java.lang.ClassNotFoundExcepti<wbr class=3D"">on:<br =
class=3D"">org.apache.kafka.common.metric<wbr class=3D"">s.stats.Rate$1<br=
 class=3D"">=E2=80=82=E2=80=82at java.net.URLClassLoader.findCl<wbr =
class=3D"">ass(URLClassLoader.java:381)<br class=3D"">=E2=80=82=E2=80=82at=
 java.lang.ClassLoader.loadClas<wbr class=3D"">s(ClassLoader.java:424)<br =
class=3D"">=E2=80=82=E2=80=82at java.lang.ClassLoader.loadClas<wbr =
class=3D"">s(ClassLoader.java:357)<br class=3D"">=E2=80=82=E2=80=82... =
22 more<br class=3D"">Taskmanager2 log:<br class=3D"">Uncaught error =
from thread [flink-akka.actor.default-disp<wbr class=3D"">atcher-17]<br =
class=3D"">shutting down JVM since 'akka.jvm-exit-on-fatal-error' is =
enabled for<br class=3D"">ActorSystem[flink]<br =
class=3D"">Java.lang.NoClassDefFoundError<wbr class=3D"">:<br =
class=3D"">org/apache/flink/streaming/con<wbr =
class=3D"">nectors/kafka/internals/Abstra<wbr class=3D"">ctFetcher$1<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.streaming.con<wbr =
class=3D"">nectors.kafka.internals.Abstra<wbr =
class=3D"">ctFetcher$OffsetGauge.getValue<wbr =
class=3D"">(AbstractFetcher.java:492)<br class=3D"">=E2=80=82=E2=80=82at<b=
r class=3D"">org.apache.flink.streaming.con<wbr =
class=3D"">nectors.kafka.internals.Abstra<wbr =
class=3D"">ctFetcher$OffsetGauge.getValue<wbr =
class=3D"">(AbstractFetcher.java:480)<br class=3D"">=E2=80=82=E2=80=82at<b=
r class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n.serializeGauge(MetricDumpSer<wbr =
class=3D"">ialization.java:213)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n.access$200(MetricDumpSeriali<wbr =
class=3D"">zation.java:50)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricDumpSerializatio<wbr =
class=3D"">n$MetricDumpSerializer.seriali<wbr =
class=3D"">ze(MetricDumpSerialization.<wbr class=3D"">java:138)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">org.apache.flink.runtime.metri<wbr =
class=3D"">cs.dump.MetricQueryService.<wbr =
class=3D"">onReceive(MetricQueryService.<wbr class=3D"">java:109)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.actor.UntypedActor$$anonf<wbr =
class=3D"">un$receive$1.applyOrElse(Untyp<wbr =
class=3D"">edActor.scala:167)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.Actor$class.aroundR<wbr class=3D"">eceive(Actor.scala:467)<br =
class=3D"">=E2=80=82=E2=80=82at akka.actor.UntypedActor.around<wbr =
class=3D"">Receive(UntypedActor.scala:97)<br class=3D"">=E2=80=82=E2=80=82=
at akka.actor.ActorCell.receiveMe<wbr =
class=3D"">ssage(ActorCell.scala:516)<br class=3D"">=E2=80=82=E2=80=82at =
akka.actor.ActorCell.invoke(Ac<wbr class=3D"">torCell.scala:487)<br =
class=3D"">=E2=80=82=E2=80=82at akka.dispatch.Mailbox.processM<wbr =
class=3D"">ailbox(Mailbox.scala:238)<br class=3D"">=E2=80=82=E2=80=82at =
akka.dispatch.Mailbox.run(Mail<wbr class=3D"">box.scala:220)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">akka.dispatch.ForkJoinExecutor<wbr =
class=3D"">Configurator$AkkaForkJoinTask.<wbr =
class=3D"">exec(AbstractDispatcher.scala:<wbr class=3D"">397)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinTask.doExec(ForkJoinTask.<wbr class=3D"">java:260)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinPool$WorkQueue.runTask(<wbr =
class=3D"">ForkJoinPool.java:1339)<br class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinPool.runWorker(ForkJoinPoo<wbr class=3D"">l.java:1979)<br =
class=3D"">=E2=80=82=E2=80=82at<br =
class=3D"">scala.concurrent.forkjoin.Fork<wbr =
class=3D"">JoinWorkerThread.run(ForkJoinW<wbr =
class=3D"">orkerThread.java:107)<br class=3D"">Caused by: =
java.lang.ClassNotFoundExcepti<wbr class=3D"">on:<br =
class=3D"">org.apache.flink.streaming.con<wbr =
class=3D"">nectors.kafka.internals.Abstra<wbr class=3D"">ctFetcher$1<br =
class=3D"">=E2=80=82=E2=80=82at java.net.URLClassLoader.findCl<wbr =
class=3D"">ass(URLClassLoader.java:381)<br class=3D"">=E2=80=82=E2=80=82at=
 java.lang.ClassLoader.loadClas<wbr class=3D"">s(ClassLoader.java:424)<br =
class=3D"">=E2=80=82=E2=80=82at java.lang.ClassLoader.loadClas<wbr =
class=3D"">s(ClassLoader.java:357)<br class=3D"">=E2=80=82=E2=80=82... =
18 more<br class=3D"">-Ebru<br class=3D""></blockquote>Hi Piotrek,<br =
class=3D"">We attached the full log of the taskmanager1.<br =
class=3D"">This may not be a dependency issue because until all of the =
task slots is full, we didn't get any No Class Def Found exception, when =
there is available memory jobs can run without exception for days.<br =
class=3D"">Also there is Kafka Instance Already Exist exception in full =
log, but this not relevant and doesn't effect jobs or task managers.<br =
class=3D"">-Ebru&lt;taskmanager1.log.zip&gt;<br =
class=3D""></blockquote></blockquote>Hi,<br class=3D"">Sorry we attached =
wrong log file. I've attached all task managers and job manager's log. =
All task managers and job manager was killed.&lt;logs.zip&gt;<br =
class=3D""></blockquote></blockquote><span =
id=3D"m_-8448316545207664169m_-5355136006477235139cid:B60F2B45-6212-409F-B=
4BB-6819130C758E@fritz.box" =
class=3D"">&lt;logs2-2.zip&gt;</span></div></div></blockquote></div><br =
class=3D""></div></div></div></div></div></div></blockquote></div><br =
class=3D""><br clear=3D"all" class=3D""><div class=3D""><br =
class=3D""></div>-- <br class=3D""><div =
class=3D"m_-8448316545207664169gmail_signature" =
data-smartmail=3D"gmail_signature"><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><font =
color=3D"#999999" class=3D"">Flavio Pompermaier<br class=3D"">Development =
Department<br class=3D""><br class=3D"">OKKAM S.r.l.<br class=3D"">Tel. =
<a href=3D"tel:+39%200461%20041809" value=3D"+390461041809" =
target=3D"_blank" class=3D"">+(39) 0461 =
041809</a></font></div></div></div></div></div></div></div></div>
</div></div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_19B75350-BBB9-4565-85CC-8C27952F5BAF--