Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <EEA27C26-704C-4984-BFD7-C54358905EF6@piksel.com>
References: <FB830A89-24C4-47A0-8F64-148C01D80996@piksel.com>
 <CAC27z=NoqWF7UrAOastTnM+=S=r68gKsttTkaLG4Zft43YKMWQ@mail.gmail.com> <EEA27C26-704C-4984-BFD7-C54358905EF6@piksel.com>
From: Stephan Ewen <sewen@apache.org>
Date: Mon, 31 Jul 2017 22:21:27 +0200
Message-ID: <CANC1h_tfuufaaq6-_AK4Z2Nsn4xbr_-2GDdQibwiTAgfE180Zg@mail.gmail.com>
Subject: Re: Flink CLI cannot submit job to Flink on Mesos
To: Francisco Gonzalez Barea <Francisco.Gonzalez@piksel.com>
Cc: Till Rohrmann <trohrmann@apache.org>, "user@flink.apache.org" <user@flink.apache.org>
Content-Type: multipart/alternative; boundary="001a1146dd347ae1480555a2c7f9"
archived-at: Mon, 31 Jul 2017 20:21:46 -0000

--001a1146dd347ae1480555a2c7f9
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Francisco!

Can you drop the explicit address of the jobmanager? The client should pick
up that address automatically from ZooKeeper as well (together with the HA
leader session ID).

Please check if you have the ZooKeeper HA config entries in the config used
by the CLI.

Stephan


On Mon, Jul 31, 2017 at 6:27 PM, Francisco Gonzalez Barea <
Francisco.Gonzalez@piksel.com> wrote:

> Hi again,
>
> On the other hand, we are running the following flink CLI command:
>
> ./flink run -d -m ${jobmanager.rpc.address}:${jobmanager.rpc.port}
>  ${our-program-jar} ${our-program-params}
>
> Maybe is the command what we are using wrongly?
>
> Thank you
>
> On 28 Jul 2017, at 11:07, Till Rohrmann <trohrmann@apache.org> wrote:
>
> Hi Francisco,
>
> have you set the right high-availability configuration options in your
> client configuration as described here [1]? If not, then Flink is not abl=
e
> to find the correct JobManager because it retrieves the address as well a=
s
> a fencing token (called leader session id) from the HA store (ZooKeeper).
>
> [1] https://ci.apache.org/projects/flink/flink-docs-
> release-1.3/setup/mesos.html#high-availability
>
> Cheers,
> Till
>
> On Thu, Jul 27, 2017 at 6:20 PM, Francisco Gonzalez Barea <
> Francisco.Gonzalez@piksel.com> wrote:
>
>> Hello,
>>
>> We=C2=B4re having lot of issues while trying to submit a job remotely us=
ing
>> the Flink CLI command line tool. We have tried different configurations =
but
>> in all of them we get errors from AKKA while trying to connect. I will t=
ry
>> to summarise the configurations we=C2=B4ve tried.
>>
>> - Flink 1.3.0 deployed within a docker container on a Mesos cluster
>> (using Marathon)
>> - This flink has the property jobmanager.rpc.address as a hostname (i.e.
>> kind of ip-XXXXXXXXX.eu <http://ip-xxxxxxxxx.eu/>.west-1.comp
>> ute.internal)
>> - Use the same version for Flink Client remotely (e.g. in my laptop).
>>
>> When I try to submit the job using the command flink run -m
>> myHostName:myPort (the same in jobmanager.rpc.address and
>> jobmanager.rpc.port) after some time waiting I get the trace at the end =
of
>> this email. In the flink side we get this error from AKKA:
>>
>> Association with remote system [akka.tcp://flink@10.203.23.24:24469] has
>> failed, address is now gated for [5000] ms. Reason: [Association failed
>> with [akka.tcp://flink@10.203.23.24:24469]] Caused by: [Connection
>> refused: /10.203.23.24:24469]
>>
>> After reading a bit, it seems there=C2=B4re some problems related to akk=
a
>> resolving hostnames to ips, so we decided to startup the same flink but
>> changing jobmanager.rpc.address to have the direct ip (i.e. kind of
>> XX.XXX.XX.XX). In this case I=C2=B4m getting same trace (at the end of t=
he
>> email) from the client side and this one from the Flink server:
>>
>> Discard message LeaderSessionMessage(00000000-
>> 0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
>> b25d5c5ced962632abc5ee9ef867792e),DETACHED)) because the expected leader
>> session ID b4f53899-5d70-467e-8e9d-e56eeb60b6e3 did not equal the
>> received leader session ID 00000000-0000-0000-0000-000000000000.
>>
>> We have tried some other stuff but without success=E2=80=A6 any clue tha=
t could
>> help us?
>>
>> Thanks in advance!
>>
>> org.apache.flink.client.program.ProgramInvocationException: The program
>> execution failed: JobManager did not respond within 60000 milliseconds
>> at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:454)
>> at org.apache.flink.client.program.StandaloneClusterClient.subm
>> itJob(StandaloneClusterClient.java:99)
>> at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:400)
>> at org.apache.flink.client.program.DetachedEnvironment.finalize
>> Execute(DetachedEnvironment.java:76)
>> at org.apache.flink.client.program.ClusterClient.run(ClusterCli
>> ent.java:345)
>> at org.apache.flink.client.CliFrontend.executeProgram(CliFronte
>> nd.java:831)
>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)
>> at org.apache.flink.client.CliFrontend.parseParameters(CliFront
>> end.java:1073)
>> at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
>> at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
>> at org.apache.flink.runtime.security.HadoopSecurityContext$1.
>> run(HadoopSecurityContext.java:43)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGro
>> upInformation.java:1548)
>> at org.apache.flink.runtime.security.HadoopSecurityContext.runS
>> ecured(HadoopSecurityContext.java:40)
>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116)
>> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
>> JobManager did not respond within 60000 milliseconds
>> at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:426)
>> at org.apache.flink.client.program.ClusterClient.runDetached(
>> ClusterClient.java:451)
>> ... 15 more
>> Caused by: java.util.concurrent.TimeoutException: Futures timed out
>> after [60000 milliseconds]
>> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223=
)
>> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>> at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(B
>> lockContext.scala:53)
>> at scala.concurrent.Await$.result(package.scala:190)
>> at scala.concurrent.Await.result(package.scala)
>> at org.apache.flink.runtime.client.JobClient.submitJobDetached(
>> JobClient.java:423)
>> ... 16 more
>>
>>
>>
>> This message is private and confidential. If you have received this
>> message in error, please notify the sender or servicedesk@piksel.com and
>> remove it from your system.
>>
>> Piksel Inc is a company registered in the United States, 2100 Powers
>> Ferry Road SE, Suite 400, Atlanta, GA 30339
>>
>
>
>

--001a1146dd347ae1480555a2c7f9
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Francisco!<div><br></div><div>Can you drop the explicit=
 address of the jobmanager? The client should pick up that address automati=
cally from ZooKeeper as well (together with the HA leader session ID).</div=
><div><br></div><div>Please check if you have the ZooKeeper HA config entri=
es in the config used by the CLI.</div><div><br></div><div>Stephan</div><di=
v><br></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mo=
n, Jul 31, 2017 at 6:27 PM, Francisco Gonzalez Barea <span dir=3D"ltr">&lt;=
<a href=3D"mailto:Francisco.Gonzalez@piksel.com" target=3D"_blank">Francisc=
o.Gonzalez@piksel.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">


<div style=3D"word-wrap:break-word">
Hi again,
<div><br>
</div>
<div>On the other hand, we are running the following flink CLI command:</di=
v>
<div><br>
</div>
<div>./flink run -d -m ${jobmanager.rpc.address}:${<wbr>jobmanager.rpc.port=
} =C2=A0${our-program-jar} ${our-program-params}</div>
<div><br>
</div>
<div>Maybe is the command what we are using wrongly?</div>
<div><br>
</div>
<div>Thank you</div>
<div><br>
<div>
<blockquote type=3D"cite"><span class=3D"">
<div>On 28 Jul 2017, at 11:07, Till Rohrmann &lt;<a href=3D"mailto:trohrman=
n@apache.org" target=3D"_blank">trohrmann@apache.org</a>&gt; wrote:</div>
<br class=3D"m_-4464596691434594244Apple-interchange-newline">
</span><div><div class=3D"h5"><div>
<div dir=3D"ltr">Hi Francisco,
<div><br>
</div>
<div>have you set the right high-availability configuration options in your=
 client configuration as described here [1]? If not, then Flink is not able=
 to find the correct JobManager because it retrieves the address as well as=
 a fencing token (called
 leader session id) from the HA store (ZooKeeper).</div>
<div><br>
</div>
<div>[1]=C2=A0<a href=3D"https://ci.apache.org/projects/flink/flink-docs-re=
lease-1.3/setup/mesos.html#high-availability" target=3D"_blank">https://ci.=
apache.org/<wbr>projects/flink/flink-docs-<wbr>release-1.3/setup/mesos.html=
#<wbr>high-availability</a></div>
<div><br>
</div>
<div>Cheers,</div>
<div>Till</div>
</div>
<div class=3D"gmail_extra"><br>
<div class=3D"gmail_quote">On Thu, Jul 27, 2017 at 6:20 PM, Francisco Gonza=
lez Barea
<span dir=3D"ltr">&lt;<a href=3D"mailto:Francisco.Gonzalez@piksel.com" targ=
et=3D"_blank">Francisco.Gonzalez@piksel.com</a><wbr>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"word-wrap:break-word">Hello,
<div><br>
</div>
<div>We=C2=B4re having lot of issues while trying to submit a job remotely =
using the Flink CLI command line tool. We have tried different configuratio=
ns but in all of them we get errors from AKKA while trying to connect. I wi=
ll try to summarise the configurations
 we=C2=B4ve tried.</div>
<div><br>
</div>
<div>- Flink 1.3.0 deployed within a docker container on a Mesos cluster (u=
sing Marathon)</div>
<div>- This flink has the property jobmanager.rpc.address as a hostname (i.=
e. kind of=C2=A0<a href=3D"http://ip-xxxxxxxxx.eu/" target=3D"_blank">ip-XX=
XXXXXXX.eu</a>.west-1.comp<wbr>ute.internal)</div>
<div>- Use the same version for Flink Client remotely (e.g. in my laptop).<=
/div>
<div><br>
</div>
<div>When I try to submit the job using the command flink run -m myHostName=
:myPort (the same in jobmanager.rpc.address and jobmanager.rpc.port) after =
some time waiting I get the trace at the end of this email. In the flink si=
de we get this error from
 AKKA:=C2=A0</div>
<div><br>
</div>
<div><font face=3D"Courier">Association with remote system [<a>akka.tcp://f=
link@10.203.23.24<wbr>:24469</a>] has failed, address is now gated for [500=
0] ms. Reason: [Association failed with [<a>akka.tcp://flink@10.203.23.24<w=
br>:24469</a>]]
 Caused by: [Connection refused: /<a href=3D"http://10.203.23.24:24469/" ta=
rget=3D"_blank">10.203.23.24:24469</a>]</font></div>
<div><br>
</div>
<div>After reading a bit, it seems there=C2=B4re some problems related to a=
kka resolving hostnames to ips, so we decided to startup the same flink but=
 changing jobmanager.rpc.address to have the direct ip (i.e. kind of XX.XXX=
.XX.XX). In this case I=C2=B4m getting
 same trace (at the end of the email) from the client side and this one fro=
m the Flink server:</div>
<div><br>
</div>
<div><font face=3D"Courier">Discard message LeaderSessionMessage(00000000-<=
wbr>0000-0000-0000-000000000000,Su<wbr>bmitJob(JobGraph(jobId: b25d5c5ced96=
2632abc5ee9ef86779<wbr>2e),DETACHED)) because the expected leader
 session ID b4f53899-5d70-467e-8e9d-e56eeb<wbr>60b6e3 did not equal the rec=
eived leader session ID 00000000-0000-0000-0000-000000<wbr>000000.</font></=
div>
<div><font face=3D"Courier"><br>
</font></div>
<div><font face=3D"Courier">We have tried some other stuff but without succ=
ess=E2=80=A6 any clue that could help us?</font></div>
<div><font face=3D"Courier"><br>
</font></div>
<div><font face=3D"Courier">Thanks in advance!</font></div>
<div><br>
</div>
<div>
<div><font face=3D"Courier">org.apache.flink.client.progra<wbr>m.ProgramInv=
ocationException: The program execution failed: JobManager did not respond =
within 60000 milliseconds</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.ClusterClient.runDetached(<wbr>ClusterClient.j=
ava:454)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.StandaloneClusterClient.subm<wbr>itJob(Standal=
oneClusterClient.<wbr>java:99)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.ClusterClient.run(ClusterCli<wbr>ent.java:400)=
</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.DetachedEnvironment.finalize<wbr>Execute(Detac=
hedEnvironment.<wbr>java:76)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.ClusterClient.run(ClusterCli<wbr>ent.java:345)=
</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend.executeProgram(CliFronte<wbr>nd.java:831)<=
/font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend.run(CliFrontend.java:<wbr>256)</font></div=
>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend.parseParameters(CliFront<wbr>end.java:1073=
)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend$2.call(CliFrontend.java:<wbr>1120)</font><=
/div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend$2.call(CliFrontend.java:<wbr>1117)</font><=
/div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.runtime.secur<wbr>ity.HadoopSecurityContext$1.<wbr>run(HadoopSecur=
ityContext.<wbr>java:43)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at java.sec=
urity.AccessController<wbr>.doPrivileged(Native Method)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at <a href=
=3D"http://javax.security.auth.Subject.do">javax.security.auth.Subject.do</=
a><wbr>As(Subject.java:422)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.hadoop.security.Use<wbr>rGroupInformation.doAs(UserGro<wbr>upInformation=
.java:1548)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.runtime.secur<wbr>ity.HadoopSecurityContext.runS<wbr>ecured(Hadoop=
SecurityContext.<wbr>java:40)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.CliFro<wbr>ntend.main(CliFrontend.java:<wbr>1116)</font></d=
iv>
<div><font face=3D"Courier">Caused by: org.apache.flink.runtime.clien<wbr>t=
.JobTimeoutException: JobManager did not respond within 60000 milliseconds<=
/font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.runtime.clien<wbr>t.JobClient.submitJobDetached(<wbr>JobClient.jav=
a:426)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.client.progra<wbr>m.ClusterClient.runDetached(<wbr>ClusterClient.j=
ava:451)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>... 15 more=
</font></div>
<div><font face=3D"Courier">Caused by: java.util.concurrent.TimeoutEx<wbr>c=
eption: Futures timed out after [60000 milliseconds]</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.impl.Promise$<wbr>DefaultPromise.ready(Promise.s<wbr>cala:219)</fo=
nt></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.impl.Promise$<wbr>DefaultPromise.result(Promise.<wbr>scala:223)</f=
ont></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.Await$$anonfu<wbr>n$result$1.apply(package.<wbr>scala:190)</font><=
/div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.BlockContext$<wbr>DefaultBlockContext$.blockOn(B<wbr>lockContext.s=
cala:53)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.Await$.result<wbr>(package.scala:190)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at scala.co=
ncurrent.Await.result(<wbr>package.scala)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>at org.apac=
he.flink.runtime.clien<wbr>t.JobClient.submitJobDetached(<wbr>JobClient.jav=
a:423)</font></div>
<div><font face=3D"Courier"><span class=3D"m_-4464596691434594244m_96207733=
7451832643Apple-tab-span" style=3D"white-space:pre-wrap"></span>... 16 more=
</font></div>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<p style=3D"font-family:arial;font-size:12px">This message is private and c=
onfidential. If you have received this message in error, please notify the =
sender or
<a href=3D"mailto:servicedesk@piksel.com" target=3D"_blank">servicedesk@pik=
sel.com</a> and remove it from your system.</p>
<p style=3D"font-family:arial;font-size:12px">Piksel Inc is a company regis=
tered in the United States, 2100 Powers Ferry Road SE, Suite 400, Atlanta, =
GA 30339</p>
</div>
</blockquote>
</div>
<br>
</div>
</div>
</div></div></blockquote>
</div>
<br>
</div>
</div>

</blockquote></div><br></div></div>

--001a1146dd347ae1480555a2c7f9--