Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
From: Piotr Nowojski <piotr@data-artisans.com>
Message-Id: <F1A09AFD-41BC-4E58-A181-F52A9692F8F4@data-artisans.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_C3E63565-3E6E-4752-8A4F-9E34CEDAEB6C"
Mime-Version: 1.0 (Mac OS X Mail 11.1 \(3445.4.7\))
Subject: Re: Off heap memory issue
Date: Wed, 15 Nov 2017 13:18:13 +0100
In-Reply-To: <CAELUF_AzGXjpJnrxL4SO+qCm+JMBS=vcYundaua2FrBtvaORJg@mail.gmail.com>
Cc: Kien Truong <duckientruong@gmail.com>,
 Javier Lopez <javier.lopez@zalando.de>,
 Robert Metzger <rmetzger@apache.org>,
 "user@flink.apache.org" <user@flink.apache.org>
To: Flavio Pompermaier <pompermaier@okkam.it>
References: <CANq+ctrgrxgwBZ6UmGps+xvLtju9Bn2vgkD=m5wJz01GtmKOLQ@mail.gmail.com>
 <CAGr9p8BRCsL9AG0z7WGo5HoF744wHHqhk8J+2uxOQfTmMLD-HQ@mail.gmail.com>
 <CAGr9p8D17CCKmmMZv2+1xPTmUFJWqndicRhkPLM1SUVbf_=RjA@mail.gmail.com>
 <CANq+ctr_Egwz=ZC8Xf-Yp2y+NCMeaYZ91+NM5v77kcoQVYS_KQ@mail.gmail.com>
 <CAELUF_Dw3ndU22Q0Lr_XwFUkO-3hc3cbN9mJTDPuf8K=3KivjA@mail.gmail.com>
 <38c6eb07-cd85-ad2c-f0ff-a0e5118431d1@gmail.com>
 <CAELUF_AzGXjpJnrxL4SO+qCm+JMBS=vcYundaua2FrBtvaORJg@mail.gmail.com>
archived-at: Wed, 15 Nov 2017 12:18:29 -0000


--Apple-Mail=_C3E63565-3E6E-4752-8A4F-9E34CEDAEB6C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Hi,

I have been able to observe some off heap memory =E2=80=9Cissues=E2=80=9D =
by submitting Kafka job provided by Javier Lopez (in different mailing =
thread).=20

TL;DR;

There was no memory leak, just memory pool =E2=80=9CMetaspace=E2=80=9D =
and =E2=80=9CCompressed Class Space=E2=80=9D are growing in size over =
time and are only rarely garbage collected. In my test case they =
together were wasting up to ~7GB of memory, while my test case could use =
as little as ~100MB. Connect with for example jconsole to your JVM, =
check their size and cut their size by half by setting:

env.java.opts: -XX:CompressedClassSpaceSize=3D***M =
-XX:MaxMetaspaceSize=3D***M

In flink-conf.yaml. Everything works fine and memory consumption still =
too high? Rinse and repeat.


Long story:

In default settings, with max heap size of 1GB, off heap memory =
consumption, memory consumption off non-heap memory pools of =
=E2=80=9CMetaspace=E2=80=9D and =E2=80=9CCompressed Class Space=E2=80=9D =
was growing in time which seemed like indefinitely, and Metaspace was =
always around ~6 times larger compared to compressed class space. =
Default max meatspace size is unlimited, while =E2=80=9CCompressed class =
space=E2=80=9D has a default max size of 1GB.=20

When I decreased the CompressedClassSpaceSize down to 100MB, memory =
consumption grew up to 90MB and then it started bouncing up and down by =
couple of MB. =E2=80=9CMetaspace=E2=80=9D was following the same =
pattern, but using ~600MB. When I decreased down MaxMetaspaceSize to =
200MB, memory consumption of both pools was bouncing around ~220MB.

It seems like there are no general guide lines how to configure those =
values, since it=E2=80=99s heavily application dependent. However this =
seems like the most likely suspect of the apparent OFF HEAP =E2=80=9Cmemor=
y leak=E2=80=9D that was reported couple of times in use cases where =
users are submitting hundreds/thousands of jobs to Flink cluster. For =
more information please check here:

=
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/conside=
rations.html =
<https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/consid=
erations.html>

Please let us know if this solves your issues.

Thanks, Piotrek

> On 13 Nov 2017, at 16:06, Flavio Pompermaier <pompermaier@okkam.it> =
wrote:
>=20
> Unfortunately the issue I've opened [1] was not a problem of Flink but =
was just caused by an ever increasing job plan.
> So no help from that..Let's hope to find out the real source of the =
problem.
> Maybe using  -Djdk.nio.maxCachedBufferSize could help (but I didn't =
try it yet)
>=20
> Best,
> Flavio
>=20
> [1] https://issues.apache.org/jira/browse/FLINK-7845 =
<https://issues.apache.org/jira/browse/FLINK-7845>
>=20
> On Wed, Oct 18, 2017 at 2:07 PM, Kien Truong <duckientruong@gmail.com =
<mailto:duckientruong@gmail.com>> wrote:
> Hi,
>=20
> We saw a similar issue in one of our job due to ByteBuffer memory =
leak[1].=20
> We fixed it using the solution in the article, setting =
-Djdk.nio.maxCachedBufferSize
>=20
> This variable is available for Java > 8u102
>=20
> Best regards,
>=20
> Kien
> [1]http://www.evanjones.ca/java-bytebuffer-leak.html =
<http://www.evanjones.ca/java-bytebuffer-leak.html>
>=20
> On 10/18/2017 4:06 PM, Flavio Pompermaier wrote:
>> We also faced the same problem, but the number of jobs we can run =
before restarting the cluster depends on the volume of the data to =
shuffle around the network. We even had problems with a single job and =
in order to avoid OOM issues we had to put some configuration to limit =
Netty memory usage, i.e.:
>>  - Add to flink.yaml -> env.java.opts: =
-Dio.netty.recycler.maxCapacity.default=3D1
>>  - Edit taskmanager.sh and change TM_MAX_OFFHEAP_SIZE from 8388607T =
to 5g
>>=20
>> At this purpose we wrote a small test to reproduce the problem and we =
opened an issue for that [1].
>> We still don't know if the problems are related however..
>>=20
>> I hope that could be helpful,
>> Flavio
>>=20
>> [1] https://issues.apache.org/jira/browse/FLINK-7845 =
<https://issues.apache.org/jira/browse/FLINK-7845>
>>=20
>> On Wed, Oct 18, 2017 at 10:48 AM, Javier Lopez =
<javier.lopez@zalando.de <mailto:javier.lopez@zalando.de>> wrote:
>> Hi Robert,
>>=20
>> Sorry to reply this late. We did a lot of tests, trying to identify =
if the problem was in our custom sources/sinks. We figured out that none =
of our custom components is causing this problem. We came up with a =
small test, and realized that the Flink nodes run out of non-heap JVM =
memory and crash after deployment of thousands of jobs.=20
>>=20
>> When rapidly deploying thousands or hundreds of thousands of Flink =
jobs - depending on job complexity in terms of resource consumption - =
Flink nodes non-heap JVM memory consumption grows until there is no more =
memory left on the machine and the Flink process crashes. Both =
TaskManagers and JobManager exhibit the same behavior. The TaskManagers =
die faster though. The memory consumption doesn't decrease after =
stopping the deployment of new jobs, with the cluster being idle (no =
running jobs).=20
>>=20
>> We could replicate the behavior by the rapid deployment of the =
WordCount Job provided in the Quickstart with a Python script.  We =
started 24 instances of the deployment script to run in parallel.
>>=20
>> The non-heap JVM memory consumption grows faster with more complex =
jobs, i.e. reading from Kafka 10K events and printing to STDOUT( * ). =
Thus less deployed jobs are needed until the TaskManagers/JobManager =
dies.
>>=20
>> We employ Flink 1.3.2 in standalone mode on AWS EC2 t2.large nodes =
with 4GB RAM inside Docker containers. For the test, we used 2 =
TaskManagers and 1 JobManager.
>>=20
>> ( * ) a slightly changed Python script was used, which waited after =
deployment 15 seconds for the 10K events to be read from Kafka, then it =
canceled the freshly deployed job via Flink REST API.
>>=20
>> If you want we can provide the Scripts and Jobs we used for this =
test. We have a workaround for this, which restarts the Flink nodes once =
a memory threshold is reached. But this has lowered the availability of =
our services.
>>=20
>> Thanks for your help.
>>=20
>> On 30 August 2017 at 10:39, Robert Metzger <rmetzger@apache.org =
<mailto:rmetzger@apache.org>> wrote:
>> I just saw that your other email is about the same issue.
>>=20
>> Since you've done a heapdump already, did you see any pattern in the =
allocated objects? Ideally none of the classes from your user code =
should stick around when no job is running.
>> What's the size of the heap dump? I'm happy to take a look at it if =
it's reasonably small.
>>=20
>> On Wed, Aug 30, 2017 at 10:27 AM, Robert Metzger <rmetzger@apache.org =
<mailto:rmetzger@apache.org>> wrote:
>> Hi Javier,
>>=20
>> I'm not aware of such issues with Flink, but if you could give us =
some more details on your setup, I might get some more ideas on what to =
look for.
>>=20
>> are you using the RocksDBStateBackend? (RocksDB is doing some JNI =
allocations, that could potentially leak memory)
>> Also, are you passing any special garbage collector options? (Maybe =
some classes are not unloaded)
>> Are you using anything else that is special (such as protobuf or avro =
formats, or any other big library)?
>>=20
>> Regards,
>> Robert
>>=20
>>=20
>>=20
>> On Mon, Aug 28, 2017 at 5:04 PM, Javier Lopez =
<javier.lopez@zalando.de <mailto:javier.lopez@zalando.de>> wrote:
>> Hi all,
>>=20
>> we are starting a lot of Flink jobs (streaming), and after we have =
started 200 or more jobs we see that the non-heap memory in the =
taskmanagers increases a lot, to the point of killing the instances. We =
found out that every time we start a new job, the committed non-heap =
memory increases by 5 to 10MB. Is this an expected behavior? Are there =
ways to prevent this?
>>=20
>>=20
>>=20
>>=20
>=20
>=20
>=20
> --=20
> Flavio Pompermaier
> Development Department
>=20
> OKKAM S.r.l.
> Tel. +(39) 0461 041809


--Apple-Mail=_C3E63565-3E6E-4752-8A4F-9E34CEDAEB6C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=utf-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; =
charset=3Dutf-8"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; line-break: after-white-space;" =
class=3D"">Hi,<div class=3D""><br class=3D""></div><div class=3D"">I =
have been able to observe some off heap memory =E2=80=9Cissues=E2=80=9D =
by submitting Kafka job provided by Javier Lopez (in different mailing =
thread).&nbsp;</div><div class=3D""><br class=3D""></div><div =
class=3D""><b class=3D"">TL;DR;</b></div><div class=3D""><br =
class=3D""></div><div class=3D"">There was no memory leak, just memory =
pool =E2=80=9CMetaspace=E2=80=9D and =E2=80=9CCompressed Class Space=E2=80=
=9D are growing in size over time and are only rarely garbage collected. =
In my test case they together were wasting up to ~7GB of memory, while =
my test case could use as little as ~100MB. Connect with for example =
jconsole to your JVM, check their size and cut their size by half by =
setting:</div><div class=3D""><div style=3D"margin: 0px; font-stretch: =
normal; font-size: 11px; line-height: normal; font-family: Menlo; =
background-color: rgb(255, 255, 255);" class=3D""><span =
style=3D"font-variant-ligatures: no-common-ligatures" class=3D""><br =
class=3D""></span></div><div style=3D"margin: 0px; font-stretch: normal; =
font-size: 11px; line-height: normal; font-family: Menlo; =
background-color: rgb(255, 255, 255);" class=3D""><span =
style=3D"font-variant-ligatures: no-common-ligatures" =
class=3D"">env.java.opts: -XX:CompressedClassSpaceSize=3D***M =
-XX:MaxMetaspaceSize=3D***M</span></div></div><div class=3D""><br =
class=3D""></div><div class=3D"">In flink-conf.yaml. Everything works =
fine and memory consumption still too high? Rinse and repeat.</div><div =
class=3D""><br class=3D""></div><div class=3D""><br class=3D""></div><div =
class=3D""><b class=3D"">Long story:</b></div><div class=3D""><br =
class=3D""></div><div class=3D"">In default settings, with max heap size =
of 1GB, off heap memory consumption, memory consumption off non-heap =
memory pools of =E2=80=9CMetaspace=E2=80=9D and =E2=80=9CCompressed =
Class Space=E2=80=9D was growing in time which seemed like indefinitely, =
and Metaspace was always around ~6 times larger compared to compressed =
class space. Default max meatspace size is unlimited, while =
=E2=80=9CCompressed class space=E2=80=9D has a default max size of =
1GB.&nbsp;</div><div class=3D""><br class=3D""></div><div class=3D"">When =
I decreased the&nbsp;<span style=3D"font-family: Menlo; font-size: 11px; =
background-color: rgb(255, 255, 255);" =
class=3D"">CompressedClassSpaceSize&nbsp;</span>down to 100MB, memory =
consumption grew up to 90MB and then it started bouncing up and down by =
couple of MB. =E2=80=9CMetaspace=E2=80=9D was following the same =
pattern, but using ~600MB. When I decreased down&nbsp;<span =
style=3D"font-family: Menlo; font-size: 11px; background-color: rgb(255, =
255, 255);" class=3D"">MaxMetaspaceSize</span>&nbsp;to 200MB, memory =
consumption of both pools was bouncing around ~220MB.</div><div =
class=3D""><br class=3D""></div><div class=3D"">It seems like there are =
no general guide lines how to configure those values, since it=E2=80=99s =
heavily application dependent. However this seems like the most likely =
suspect of the <b class=3D"">apparent OFF HEAP =E2=80=9Cmemory =
leak=E2=80=9D</b> that was reported couple of times in use cases where =
users are submitting hundreds/thousands of jobs to Flink cluster. For =
more information please check here:</div><div class=3D""><br =
class=3D""></div><div class=3D""><a =
href=3D"https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning=
/considerations.html" =
class=3D"">https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctun=
ing/considerations.html</a></div><div class=3D""><br class=3D""></div><div=
 class=3D"">Please let us know if this solves your issues.</div><div =
class=3D""><br class=3D""></div><div class=3D"">Thanks, =
Piotrek</div><div class=3D""><div><br class=3D""><blockquote type=3D"cite"=
 class=3D""><div class=3D"">On 13 Nov 2017, at 16:06, Flavio Pompermaier =
&lt;<a href=3D"mailto:pompermaier@okkam.it" =
class=3D"">pompermaier@okkam.it</a>&gt; wrote:</div><br =
class=3D"Apple-interchange-newline"><div class=3D""><div dir=3D"ltr" =
class=3D"">Unfortunately the issue I've opened [1] was not a problem of =
Flink but was just caused by an ever increasing job plan.<div =
class=3D"">So no help from that..Let's hope to find out the real source =
of the problem.</div><div class=3D"">Maybe using =
&nbsp;-Djdk.nio.maxCachedBufferSize could help (but I didn't try it =
yet)</div><div class=3D""><br class=3D""></div><div =
class=3D"">Best,</div><div class=3D"">Flavio<br class=3D""><div =
class=3D""><span style=3D"font-size:12.8px" class=3D""><br =
class=3D""></span></div><div class=3D""><span style=3D"font-size:12.8px" =
class=3D"">[1]&nbsp;</span><a =
href=3D"https://issues.apache.org/jira/browse/FLINK-7845" =
target=3D"_blank" style=3D"font-size:12.8px" =
class=3D"">https://issues.apache.org/<wbr =
class=3D"">jira/browse/FLINK-7845</a></div></div></div><div =
class=3D"gmail_extra"><br class=3D""><div class=3D"gmail_quote">On Wed, =
Oct 18, 2017 at 2:07 PM, Kien Truong <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:duckientruong@gmail.com" target=3D"_blank" =
class=3D"">duckientruong@gmail.com</a>&gt;</span> wrote:<br =
class=3D""><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
 =20
   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000" class=3D""><p =
class=3D"">Hi,</p><p class=3D"">We saw a similar issue in one of our job =
due to ByteBuffer memory
      leak[1]. <br class=3D"">
    </p><p class=3D"">We fixed it using the solution in the article, =
setting -D<span style=3D"font-family:Courier New" =
class=3D"">jdk.nio.maxCachedBufferSize</span></p><p class=3D"">This =
variable is available for Java &gt; 8u102</p><p class=3D"">Best =
regards,</p><p class=3D"">Kien<br class=3D"">
    </p><p class=3D"">[1]<a =
class=3D"m_-7666958316110148209moz-txt-link-freetext" =
href=3D"http://www.evanjones.ca/java-bytebuffer-leak.html" =
target=3D"_blank">http://www.evanjones.ca/<wbr =
class=3D"">java-bytebuffer-leak.html</a><br class=3D"">
    </p><div class=3D""><div class=3D"h5">
    <div class=3D"m_-7666958316110148209moz-cite-prefix"><br class=3D"">
    </div>
    <div class=3D"m_-7666958316110148209moz-cite-prefix">On 10/18/2017 =
4:06 PM, Flavio
      Pompermaier wrote:<br class=3D"">
    </div>
    <blockquote type=3D"cite" class=3D"">
      <div dir=3D"ltr" class=3D"">We also faced the same problem, but =
the number of
        jobs we can run before restarting the cluster depends on the
        volume of the data to shuffle around the network. We even had
        problems with a single job and in order to avoid OOM issues we
        had to put some configuration to limit Netty memory usage, i.e.:
        <div class=3D""><span style=3D"font-size:12.8px" =
class=3D"">&nbsp;- Add to&nbsp;</span><span =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-il" =
style=3D"font-size:12.8px">flink</span><span style=3D"font-size:12.8px" =
class=3D"">.yaml -&gt;&nbsp;</span><span style=3D"font-size:12.8px" =
class=3D"">env.java.</span><span =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-il" =
style=3D"font-size:12.8px">opts</span><span style=3D"font-size:12.8px" =
class=3D"">: -Dio.</span><span =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-il" =
style=3D"font-size:12.8px">netty</span><span style=3D"font-size:12.8px" =
class=3D"">.recycler.maxCapacit</span><span style=3D"font-size:12.8px" =
class=3D""><wbr class=3D"">y.default=3D1</span><br class=3D"">
        </div>
        <div class=3D"">
          <div style=3D"font-size:12.8px" class=3D"">&nbsp;- Edit =
taskmanager.sh&nbsp;<span style=3D"font-size:12.8px" class=3D"">and =
change TM_MAX_OFFHEAP_SIZE
              from 8388607T to 5g</span></div>
          <div class=3D""><br class=3D"">
          </div>
          <div class=3D"">At this purpose we wrote a small test to =
reproduce the
            problem and we opened an issue for that [1].
            <div class=3D"">We still don't know if the problems are =
related
              however..</div>
            <div class=3D""><br class=3D"">
            </div>
            <div class=3D"">I hope that could be helpful,</div>
            <div class=3D"">Flavio<br class=3D"">
              <div class=3D""><br class=3D"">
              </div>
              <div class=3D"">[1]&nbsp;<a =
href=3D"https://issues.apache.org/jira/browse/FLINK-7845" =
target=3D"_blank" class=3D"">https://issues.apache.org/<wbr =
class=3D"">jira/browse/FLINK-7845</a><br class=3D"">
              </div>
            </div>
            <div class=3D"gmail_extra"><br class=3D"">
              <div class=3D"gmail_quote">On Wed, Oct 18, 2017 at 10:48 =
AM,
                Javier Lopez <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:javier.lopez@zalando.de" target=3D"_blank" =
class=3D"">javier.lopez@zalando.de</a>&gt;</span>
                wrote:<br class=3D"">
                <blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                  <div dir=3D"ltr" class=3D"">Hi Robert,
                    <div class=3D""><br class=3D"">
                    </div>
                    <div class=3D"">Sorry to reply this late. We did a =
lot of
                      tests, trying to identify if the problem was in
                      our custom sources/sinks. We figured out that none
                      of our custom components is causing this problem.
                      We came up with a small test, and realized that
                      the&nbsp;Flink nodes run out of non-heap JVM =
memory and
                      crash after deployment of thousands of =
jobs.&nbsp;</div>
                    <div class=3D""><br class=3D"">
                    </div>
                    <div class=3D"">
                      <div class=3D"">When rapidly deploying thousands =
or hundreds
                        of thousands of Flink jobs - depending on job
                        complexity in terms of resource consumption -
                        Flink nodes non-heap JVM memory consumption
                        grows until there is no more memory left on the
                        machine and the Flink process crashes. Both
                        TaskManagers and JobManager exhibit the same
                        behavior. The TaskManagers die faster though.
                        The memory consumption doesn't decrease after
                        stopping the deployment of new jobs, with the
                        cluster being idle (no running =
jobs).&nbsp;</div>
                      <div class=3D""><br class=3D"">
                      </div>
                      <div class=3D"">We could replicate the behavior by =
the rapid
                        deployment of the WordCount Job provided in the
                        Quickstart with a Python script.&nbsp; We =
started 24
                        instances of the deployment script to run in
                        parallel.</div>
                      <div class=3D""><br class=3D"">
                      </div>
                      <div class=3D"">The non-heap JVM memory =
consumption grows
                        faster with more complex jobs, i.e. reading from
                        Kafka 10K events and printing to STDOUT( * ).
                        Thus less deployed jobs are needed until the
                        TaskManagers/JobManager dies.</div>
                      <div class=3D""><br class=3D"">
                      </div>
                      <div class=3D"">We employ Flink 1.3.2 in =
standalone mode on
                        AWS EC2 t2.large nodes with 4GB RAM inside
                        Docker containers. For the test, we used 2
                        TaskManagers and 1 JobManager.</div>
                      <div class=3D""><br class=3D"">
                      </div>
                      <div class=3D"">( * ) a slightly changed Python =
script was
                        used, which waited after deployment 15 seconds
                        for the 10K events to be read from Kafka, then
                        it canceled the freshly deployed job via Flink
                        REST API.</div>
                    </div>
                    <div class=3D""><br class=3D"">
                    </div>
                    <div class=3D"">If you want we can provide the =
Scripts and Jobs
                      we used for this test. We have a workaround for
                      this, which restarts the Flink nodes once a memory
                      threshold is reached. But this has lowered the
                      availability of our services.</div>
                    <div class=3D""><br class=3D"">
                    </div>
                    <div class=3D"">Thanks for your help.</div>
                  </div>
                  <div class=3D"gmail_extra"><br class=3D"">
                    <div class=3D"gmail_quote">On 30 August 2017 at =
10:39,
                      Robert Metzger <span dir=3D"ltr" class=3D"">&lt;<a =
href=3D"mailto:rmetzger@apache.org" target=3D"_blank" =
class=3D"">rmetzger@apache.org</a>&gt;</span>
                      wrote:<br class=3D"">
                      <blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">
                        <div dir=3D"ltr" class=3D"">I just saw that your =
other email
                          is about the same issue.
                          <div class=3D""><br class=3D"">
                          </div>
                          <div class=3D"">Since you've done a heapdump =
already, did
                            you see any pattern in the allocated
                            objects? Ideally none of the classes from
                            your user code should stick around when no
                            job is running.</div>
                          <div class=3D"">What's the size of the heap =
dump? I'm
                            happy to take a look at it if it's
                            reasonably small.</div>
                        </div>
                        <div =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-m_-8357519865968=
016482m_4143943627539604297m_-5307439483837088583HOEnZb">
                          <div =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-m_-8357519865968=
016482m_4143943627539604297m_-5307439483837088583h5">
                            <div class=3D"gmail_extra"><br class=3D"">
                              <div class=3D"gmail_quote">On Wed, Aug 30,
                                2017 at 10:27 AM, Robert Metzger <span =
dir=3D"ltr" class=3D"">&lt;<a href=3D"mailto:rmetzger@apache.org" =
target=3D"_blank" class=3D"">rmetzger@apache.org</a>&gt;</span>
                                wrote:<br class=3D"">
                                <blockquote class=3D"gmail_quote" =
style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid =
rgb(204,204,204);padding-left:1ex">
                                  <div dir=3D"ltr" class=3D"">Hi Javier,
                                    <div class=3D""><br class=3D"">
                                    </div>
                                    <div class=3D"">I'm not aware of =
such issues
                                      with Flink, but if you could give
                                      us some more details on your
                                      setup, I might get some more ideas
                                      on what to look for.</div>
                                    <div class=3D""><br class=3D"">
                                    </div>
                                    <div class=3D"">are you using the
                                      RocksDBStateBackend? (RocksDB is
                                      doing some JNI allocations, that
                                      could potentially leak =
memory)</div>
                                    <div class=3D"">Also, are you =
passing any
                                      special garbage collector options?
                                      (Maybe some classes are not
                                      unloaded)</div>
                                    <div class=3D"">Are you using =
anything else
                                      that is special (such as protobuf
                                      or avro formats, or any other big
                                      library)?</div>
                                    <div class=3D""><br class=3D"">
                                    </div>
                                    <div class=3D"">Regards,</div>
                                    <div class=3D"">Robert</div>
                                    <div class=3D""><br class=3D"">
                                    </div>
                                    <div class=3D""><br class=3D"">
                                    </div>
                                  </div>
                                  <div =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-m_-8357519865968=
016482m_4143943627539604297m_-5307439483837088583m_-2794019410666480952HOE=
nZb">
                                    <div =
class=3D"m_-7666958316110148209m_3565785754944840639gmail-m_-8357519865968=
016482m_4143943627539604297m_-5307439483837088583m_-2794019410666480952h5"=
>
                                      <div class=3D"gmail_extra"><br =
class=3D"">
                                        <div class=3D"gmail_quote">On =
Mon,
                                          Aug 28, 2017 at 5:04 PM,
                                          Javier Lopez <span dir=3D"ltr" =
class=3D"">&lt;<a href=3D"mailto:javier.lopez@zalando.de" =
target=3D"_blank" class=3D"">javier.lopez@zalando.de</a>&gt;</span>
                                          wrote:<br class=3D"">
                                          <blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
                                            <div dir=3D"ltr" class=3D"">Hi=
 all,
                                              <div class=3D""><br =
class=3D"">
                                              </div>
                                              <div class=3D"">we are =
starting a lot
                                                of Flink jobs
                                                (streaming), and after
                                                we have started 200 or
                                                more jobs we see that
                                                the non-heap memory in
                                                the taskmanagers
                                                increases a lot, to the
                                                point of killing the
                                                instances. We found out
                                                that every time we start
                                                a new job, the committed
                                                non-heap memory
                                                increases by 5 to 10MB.
                                                Is this an expected
                                                behavior? Are there ways
                                                to prevent this?</div>
                                            </div>
                                          </blockquote>
                                        </div>
                                        <br class=3D"">
                                      </div>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                              <br class=3D"">
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                    <br class=3D"">
                  </div>
                </blockquote>
              </div>
              <br class=3D"">
            </div>
          </div>
        </div>
      </div>
    </blockquote>
  </div></div></div>

</blockquote></div><br class=3D""><br clear=3D"all" class=3D""><div =
class=3D""><br class=3D""></div>-- <br class=3D""><div =
class=3D"gmail_signature" data-smartmail=3D"gmail_signature"><div =
dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" class=3D""><div =
class=3D""><div dir=3D"ltr" class=3D""><div class=3D""><div dir=3D"ltr" =
class=3D""><font color=3D"#999999" class=3D"">Flavio Pompermaier<br =
class=3D"">Development Department<br class=3D""><br class=3D"">OKKAM =
S.r.l.<br class=3D"">Tel. +(39) 0461 =
041809</font></div></div></div></div></div></div></div></div>
</div>
</div></blockquote></div><br class=3D""></div></body></html>=

--Apple-Mail=_C3E63565-3E6E-4752-8A4F-9E34CEDAEB6C--