Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAO2XNQ98KTx4zkA1SFq9X7fmB9u6-zby8GCL9RPbdbaNc4bJQA@mail.gmail.com>
References: 
 <CAO2XNQ98KTx4zkA1SFq9X7fmB9u6-zby8GCL9RPbdbaNc4bJQA@mail.gmail.com>
Date: Wed, 30 Sep 2015 17:09:26 +0200
Message-ID: 
 <CAO2XNQ9s9Yy+mtYTp-QZZxwZ+TUKxc0ri_L7DdJY6Uhh+4FmLQ@mail.gmail.com>
Subject: Re: All but one TMs connect when JM has more than 16G of memory
From: Robert Schmidtke <ro.schmidtke@gmail.com>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a1141015c03ceda0520f851fc

--001a1141015c03ceda0520f851fc
Content-Type: text/plain; charset=UTF-8

I should say I'm running the current Flink master branch.

On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <ro.schmidtke@gmail.com>
wrote:

> It's me again. This is a strange issue, I hope I managed to find the right
> keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of
> memory each.
>
> When running my job like so:
>
> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 .....
>
> The job completes without any problems. When running it like so:
>
> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....
>
> (note the one more M of memory for the JM), the execution stalls,
> continuously reporting:
>
> .....
> TaskManager status (6/7)
> TaskManager status (6/7)
> TaskManager status (6/7)
> .....
>
> I did some poking around, but I couldn't find any direct correlation with
> the code.
>
> The JM log says:
>
> .....
> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>        -  JVM Options:
> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>        -     -Xmx12289M
> .....
>
> but then continues to report
>
> .....
> 16:52:59,311 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> 16:52:59,831 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> 16:53:00,351 INFO
>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The user
> requested 7 containers, 6 running. 1 containers missing
> .....
>
> forever until I cancel the job.
>
> If you have any ideas I'm happy to try them out. Thanks in advance for any
> hints! Cheers.
>
> Robert
> --
> My GPG Key ID: 336E2680
>


-- 
My GPG Key ID: 336E2680

--001a1141015c03ceda0520f851fc
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I should say I&#39;m running the current Flink master bran=
ch.</div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Wed, =
Sep 30, 2015 at 5:02 PM, Robert Schmidtke <span dir=3D"ltr">&lt;<a href=3D"=
mailto:ro.schmidtke@gmail.com" target=3D"_blank">ro.schmidtke@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">It=
9;s me again. This is a strange issue, I hope I managed to find the right k=
eywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of me=
mory each.<div><br></div><div>When running my job like so:</div><div><br></=
div><div>$FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -y=
n 7 .....<br></div><div><br></div><div>The job completes without any proble=
ms. When running it like so:</div><div><br></div><div>$FLINK_HOME/bin/flink=
 run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 .....<br></div><div><br></=
div><div>(note the one more M of memory for the JM), the execution stalls, =
continuously reporting:</div><div><br></div><div>.....</div><div><div>TaskM=
anager status (6/7)</div><div>TaskManager status (6/7)</div><div>TaskManage=
r status (6/7)</div></div><div>.....</div><div><br></div><div>I did some po=
king around, but I couldn&#39;t find any direct correlation with the code.<=
/div><div><br></div><div>The JM log says:</div><div><br></div><div>.....</d=
iv><div><div>16:49:01,893 INFO =C2=A0org.apache.flink.yarn.ApplicationMaste=
r$ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0- =C2=A0JVM Options:</div><div>16:49:01,893 INFO =C2=A0org.apache.fli=
nk.yarn.ApplicationMaster$ =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0- =C2=A0 =C2=A0 -Xmx12289M</div></div><div>....=
.</div><div><br></div><div>but then continues to report</div><div><br></div=
><div>.....</div><div><div>16:52:59,311 INFO =C2=A0org.apache.flink.yarn.Ap=
plicationMaster$$anonfun$2$$anon$1 =C2=A0 =C2=A0- The user requested 7 cont=
ainers, 6 running. 1 containers missing</div><div>16:52:59,831 INFO =C2=A0o=
rg.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 =C2=A0 =C2=A0- Th=
e user requested 7 containers, 6 running. 1 containers missing</div><div>16=
:53:00,351 INFO =C2=A0org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$a=
non$1 =C2=A0 =C2=A0- The user requested 7 containers, 6 running. 1 containe=
rs missing</div></div><div>.....</div><div><br></div><div>forever until I c=
ancel the job.</div><div><br></div><div>If you have any ideas I&#39;m happy=
 to try them out. Thanks in advance for any hints! Cheers.</div><span class=
=3D"HOEnZb"><font color=3D"#888888"><div><br></div><div>Robert</div><div>--=
 <br><div><div dir=3D"ltr">My GPG Key ID: 336E2680<br></div></div>
</div></font></span></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"gmail_signature"><div dir=3D"ltr">My GPG Key ID: 336E2680<br></div></di=
v>
</div>

--001a1141015c03ceda0520f851fc--