Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <4385FC73-9684-46EA-B36F-18691350C009@tngtech.com>
References: <B0237754-2851-4FA1-8866-3FB8D783ED4B@tngtech.com>
	<CAKiyyaHYmojSPXxCxucPOHRLzTyUxO-fL5XnOz44gN5F8HWQhw@mail.gmail.com>
	<CAKiyyaG7dHqLs_PUGMoWkXo+m7rsoab2R7Yew5au4UGmWC1uEw@mail.gmail.com>
	<CAC27z=PxTtZZxiahKH391DVRbAmRf6vz-QhiEDN75udfVx+Z2w@mail.gmail.com>
	<CANC1h_sX5GHBTttLt86DmN1RG6J_67Tdrpoj8z=gXKwGLBZQxA@mail.gmail.com>
	<CAC27z=Ps0tBFV6iK0wTQ3c-QTrDEXBdpbTw2ASc4pACFGbbeKw@mail.gmail.com>
	<CANC1h_vEadk64dhhBa_0SK6fUC_zG_j-v96=5tZYjf-22aFOxw@mail.gmail.com>
	<CAC27z=Mb-j8rKnqXAcqmjoGJdiq52X5f7BoboTUp5=BeRmtOhg@mail.gmail.com>
	<4385FC73-9684-46EA-B36F-18691350C009@tngtech.com>
Date: Thu, 3 Mar 2016 14:32:20 +0100
Message-ID: 
 <CAC27z=M-64iORsar_EBhs7OJGiv6+C=K0QrkoEOQwTwv+F4FYg@mail.gmail.com>
Subject: Re: YARN JobManager HA using wrong network interface
From: Till Rohrmann <trohrmann@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=001a114a91b2218462052d2507db

--001a114a91b2218462052d2507db
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I've created an issue [1] and opened a PR [2] to fix the issue.

[1] https://issues.apache.org/jira/browse/FLINK-3570
[2] https://github.com/apache/flink/pull/1758

Cheers,
Till


On Thu, Mar 3, 2016 at 12:33 PM, Maximilian Bode <
maximilian.bode@tngtech.com> wrote:

> Hi Ufuk, Till and Stephan,
>
> Yes, that is what we observed. The primary hostname, i.e. the one returne=
d
> by the unix hostname command, is in fact bound to the eth0 interface,
> whereas Flink uses the eth1 interface (pertaining to another hostname).
>
> Changing akka.lookup.timeout to 100 s seems to fix the problem as now the
> new job manager is available in sufficient time. I still would agree with
> Stephan on taking the local hostname being the preferred strategy.
>
> Cheers,
>  Max
> =E2=80=94
> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hring
> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Robert Da=
hlke
> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>
> Am 03.03.2016 um 12:29 schrieb Till Rohrmann <till.rohrmann@gmail.com>:
>
> No I don't think this behaviour has been introduced by HA. That is the
> default behaviour we used for a long time. If you think we should still
> change it, then I can open an issue for it.
>
> On Thu, Mar 3, 2016 at 12:20 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> Okay, that is a change from the original behavior, introduced in HA.
>> Originally, if the connection attempts failed, it always returned the In=
etAddress.getLocalHost()
>> interface.
>> I think we should change it back to that, because that interface is by
>> far the best possible heuristic.
>>
>> On Thu, Mar 3, 2016 at 11:39 AM, Till Rohrmann <trohrmann@apache.org>
>> wrote:
>>
>>> If I=E2=80=99m not mistaken, then it=E2=80=99s not necessarily true tha=
t the heuristic
>>> returns InetAddress.getLocalHost() in all cases. The heuristic will
>>> select the first network interface with the afore-mentioned conditions =
but
>>> before returning it, it will try a last time to connect to the JM via t=
he
>>> interface bound to InetAddress.getLocalHost(). However, if this fails,
>>> then the heuristically selected network interface will be returned.
>>> =E2=80=8B
>>>
>>> On Thu, Mar 3, 2016 at 10:49 AM, Stephan Ewen <sewen@apache.org> wrote:
>>>
>>>> If the ThasManager cannot connect to the JobManager, it will use the
>>>> interface that is bound to the machine's host name
>>>> ("InetAddress.getLocalHost()").
>>>>
>>>> So, the best way to fix this would be to make sure that all machines
>>>> have a proper network configuration. Then Flink would either use an ad=
dress
>>>> that can connect (via trying various interfaces), or it would default =
back
>>>> to the hostname/interface that is configured on the machine.
>>>>
>>>>
>>>> On Thu, Mar 3, 2016 at 10:43 AM, Till Rohrmann <trohrmann@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Max,
>>>>>
>>>>> the problem is that before starting the TM, we have to find the
>>>>> network interface which is reachable by the other machines. So what w=
e do
>>>>> is to connect to the current JobManager. If it should happen, as in y=
our
>>>>> case, that the JobManager just died and the new JM address has not be=
en
>>>>> written to ZooKeeper, then the TMs don=E2=80=99t have much choice oth=
er than using
>>>>> the heuristic.
>>>>>
>>>>> I can=E2=80=99t really tell why eth1 is chosen over eth0. The conditi=
on is
>>>>> that the interface address is an Inet4Address, no link local address
>>>>> as well as not a loopback address.
>>>>>
>>>>> Thus, Ufuk=E2=80=99s solution, to increase akka.lookup.timeout seems =
to be
>>>>> the easiest solution to solve your problem. I=E2=80=99ve checked the =
default value
>>>>> is set to 10 s which might be a bit too low for restarting a new JM
>>>>> and publishing its address via ZooKeeper.
>>>>>
>>>>> Cheers,
>>>>> Till
>>>>> =E2=80=8B
>>>>>
>>>>> On Thu, Mar 3, 2016 at 10:28 AM, Ufuk Celebi <uce@apache.org> wrote:
>>>>>
>>>>>> I had an offline chat with Till about this. He pointed out that the
>>>>>> address is chosen once at start up time (while not being able to
>>>>>> connect to the old job manager) and then it stays fixed at eth1.
>>>>>>
>>>>>> You can increase the lookup timeout by setting akka.lookup.timeout t=
o
>>>>>> a higher value (like 100 s). This is the only workaroud I'm aware of
>>>>>> at this point. Maybe Till can chime in here whether this has other
>>>>>> implications as well?
>>>>>>
>>>>>> =E2=80=93 Ufuk
>>>>>>
>>>>>> On Thu, Mar 3, 2016 at 9:59 AM, Ufuk Celebi <uce@apache.org> wrote:
>>>>>> > Hey Max!
>>>>>> >
>>>>>> > for the first WARN in
>>>>>> > org.apache.flink.runtime.webmonitor.JobManagerRetriever: this is
>>>>>> > expected if the new leader has not updated ZooKeeper yet. The
>>>>>> > important thing is that the new leading job manager is eventually
>>>>>> > retrieved. This did happen, right?
>>>>>> >
>>>>>> > Regarding eth1 vs. eth0: After the new job manager becomes leader,
>>>>>> the
>>>>>> > task manager should re-try connecting to it with the same strategy
>>>>>> as
>>>>>> > in the initial connection establishment (e.g. try SLOW first and
>>>>>> only
>>>>>> > fall back to HEURISTIC). Can you see in the logs whether this
>>>>>> happens?
>>>>>> >
>>>>>> > The best thing would be to share the complete logs. Is this
>>>>>> possible?
>>>>>> > If not publicly, feel free to send it to me privately (uce at apac=
he
>>>>>> > org).
>>>>>> >
>>>>>> > =E2=80=93 Ufuk
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Mar 3, 2016 at 9:21 AM, Maximilian Bode
>>>>>> > <maximilian.bode@tngtech.com> wrote:
>>>>>> >> Hi everyone,
>>>>>> >>
>>>>>> >> we are trying to get to work JobManager HA in the context of a
>>>>>> per-job YARN
>>>>>> >> session using the 1.0.0-rc3 from a few days ago and are having a
>>>>>> problem
>>>>>> >> concerning task managers with several network interfaces.
>>>>>> >>
>>>>>> >> After manually killing the job manager process, the jobmanager.lo=
g
>>>>>> on the
>>>>>> >> newly allocated second job manager reads:
>>>>>> >> ---
>>>>>> >> 2016-03-02 18:01:09,635 WARN  Remoting
>>>>>> >> - Tried to associate with unreachable remote address
>>>>>> >> [akka.tcp://flink@10.127.68.136:34811]. Address is now gated for
>>>>>> 5000 ms,
>>>>>> >> all messages to this address will be delivered to dead letters.
>>>>>> Reason:
>>>>>> >> Connection refused: /10.127.68.136:34811
>>>>>> >> 2016-03-02 18:01:09,644 WARN
>>>>>> >> org.apache.flink.runtime.webmonitor.JobManagerRetriever       -
>>>>>> Failed to
>>>>>> >> retrieve leader gateway and port.
>>>>>> >> akka.actor.ActorNotFound: Actor not found for:
>>>>>> >> ActorSelection[Anchor(akka.tcp://flink@10.127.68.136:34811/),
>>>>>> >> Path(/user/jobmanager)]
>>>>>> >> at
>>>>>> >>
>>>>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection=
.scala:65)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection=
.scala:63)
>>>>>> >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(B=
atchingExecutor.scala:67)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Bat=
chingExecutor.scala:82)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingEx=
ecutor.scala:59)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingEx=
ecutor.scala:59)
>>>>>> >> at
>>>>>> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:7=
2)
>>>>>> >> at
>>>>>> akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatche=
dExecute(Future.scala:74)
>>>>>> >> at
>>>>>> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:=
110)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(=
Future.scala:73)
>>>>>> >> at
>>>>>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scal=
a:40)
>>>>>> >> at
>>>>>> >>
>>>>>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.sca=
la:248)
>>>>>> >> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
>>>>>> >> at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508=
)
>>>>>> >> at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541=
)
>>>>>> >> at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(Re=
moteActorRefProvider.scala:87)
>>>>>> >> at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
>>>>>> >> at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>>>>>> >> at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
>>>>>> >> at
>>>>>> >>
>>>>>> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandl=
ing$$finishTerminate(FaultHandling.scala:210)
>>>>>> >> at
>>>>>> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala=
:172)
>>>>>> >> at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>>>>>> >> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>>>>>> >> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>>>>>> >> at
>>>>>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
>>>>>> >> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>>>> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>>>>> >> at
>>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>> >> at
>>>>>> >>
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPoo=
l.java:1339)
>>>>>> >> at
>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1=
979)
>>>>>> >> at
>>>>>> >>
>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThr=
ead.java:107)
>>>>>> >> ---
>>>>>> >> where the IP not found is from the old job manager. So far, is
>>>>>> this the
>>>>>> >> expected behavior?
>>>>>> >>
>>>>>> >> The problem then arises on a new task manager, which also tries t=
o
>>>>>> connect
>>>>>> >> to the old job manager unsuccessfully. The
>>>>>> ZooKeeperLeaderRetrievalService
>>>>>> >> starts cycling through the available network interfaces, as can b=
e
>>>>>> seen in
>>>>>> >> the relevant taskmanager.log:
>>>>>> >> ---
>>>>>> >> 2016-03-02 18:01:13,636 INFO
>>>>>> >>
>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalSer=
vice  -
>>>>>> >> Starting ZooKeeperLeaderRetrievalService.
>>>>>> >> 2016-03-02 18:01:13,646 INFO
>>>>>> >> org.apache.flink.runtime.util.LeaderRetrievalUtils            -
>>>>>> Trying to
>>>>>> >> select the network interface and address to use by connecting to
>>>>>> the leading
>>>>>> >> JobManager.
>>>>>> >> 2016-03-02 18:01:13,646 INFO
>>>>>> >> org.apache.flink.runtime.util.LeaderRetrievalUtils            -
>>>>>> TaskManager
>>>>>> >> will try to connect for 10000 milliseconds before falling back to
>>>>>> heuristics
>>>>>> >> 2016-03-02 18:01:13,712 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Retrieved new target address /10.127.68.136:34811.
>>>>>> >> 2016-03-02 18:01:14,079 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Trying to connect to address /10.127.68.136:34811
>>>>>> >> 2016-03-02 18:01:14,082 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address
>>>>>> >> 'task.manager.eth0.hostname.com/10.127.68.136': Connection refuse=
d
>>>>>> >> 2016-03-02 18:01:14,082 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/10.127.68.136': Connection
>>>>>> refused
>>>>>> >> 2016-03-02 18:01:14,082 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/10.120.193.110': Connection
>>>>>> refused
>>>>>> >> 2016-03-02 18:01:14,082 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/10.127.68.136': Connection
>>>>>> refused
>>>>>> >> 2016-03-02 18:01:14,083 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/127.0.0.1': Connection refused
>>>>>> >> 2016-03-02 18:01:14,083 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/10.120.193.110': Connection
>>>>>> refused
>>>>>> >> 2016-03-02 18:01:14,083 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/10.127.68.136': Connection
>>>>>> refused
>>>>>> >> 2016-03-02 18:01:14,083 INFO
>>>>>> org.apache.flink.runtime.net.ConnectionUtils
>>>>>> >> - Failed to connect from address '/127.0.0.1': Connection refused
>>>>>> >> ---
>>>>>> >> After five repetitions, the task manager stops trying to retrieve
>>>>>> the leader
>>>>>> >> and using the HEURISTIC strategy ends up using  eth1
>>>>>> (10.120.193.110) from
>>>>>> >> now on:
>>>>>> >> ---
>>>>>> >> 2016-03-02 18:01:23,650 INFO
>>>>>> >>
>>>>>> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalSer=
vice  -
>>>>>> >> Stopping ZooKeeperLeaderRetrievalService.
>>>>>> >> 2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ClientCnxn
>>>>>> >> - EventThread shut down
>>>>>> >> 2016-03-02 18:01:23,655 INFO  org.apache.zookeeper.ZooKeeper
>>>>>> >> - Session: 0x25229757cff035b closed
>>>>>> >> 2016-03-02 18:01:23,664 INFO
>>>>>> >> org.apache.flink.runtime.taskmanager.TaskManager              -
>>>>>> TaskManager
>>>>>> >> will use hostname/address 'task.manager.eth1.hostname.com'
>>>>>> (10.120.193.110)
>>>>>> >> for communication.
>>>>>> >> ---
>>>>>> >> Following the new jobmanager is discovered and the taskmanager is
>>>>>> able to
>>>>>> >> register at the jobmanager using eth1. The problem is that
>>>>>> connections TO
>>>>>> >> eth1 are not possible. So flink should always use eth0. The
>>>>>> exception we
>>>>>> >> later see is:
>>>>>> >> ---
>>>>>> >> java.io.IOException: Connecting the channel failed: Connecting to
>>>>>> remote
>>>>>> >> task manager + 'other.task.manager.eth1.hostname/
>>>>>> 10.120.193.111:46620' has
>>>>>> >> failed. This might indicate that the remote task manager has been
>>>>>> lost.
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFact=
ory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196=
)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFact=
ory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFact=
ory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.netty.NettyConnectionManager.cre=
atePartitionRequestClient(NettyConnectionManager.java:60)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputCh=
annel.requestSubpartition(RemoteInputChannel.java:115)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGa=
te.requestPartitions(SingleInputGate.java:388)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGa=
te.getNextBufferOrEvent(SingleInputGate.java:411)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocke=
d(BarrierBuffer.java:108)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.streaming.runtime.io.StreamInputProcessor.processIn=
put(StreamInputProcessor.java:175)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneI=
nputStreamTask.java:65)
>>>>>> >> at
>>>>>> >>
>>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTas=
k.java:224)
>>>>>> >> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
>>>>>> >> at java.lang.Thread.run(Thread.java:744)
>>>>>> >> ---
>>>>>> >> The root cause seems to be that network interface selection is
>>>>>> still using
>>>>>> >> the old jobmanager location and hence is not able to choose the
>>>>>> right
>>>>>> >> interface. In particular, it seems that iteration order over the
>>>>>> network
>>>>>> >> interfaces differs between the HEURISTIC and SLOW strategy, which
>>>>>> then leads
>>>>>> >> to the wrong interface being selected.
>>>>>> >>
>>>>>> >> Cheers,
>>>>>> >>  Max
>>>>>> >> =E2=80=94
>>>>>> >> Maximilian Bode * Junior Consultant * maximilian.bode@tngtech.com
>>>>>> >> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6h=
ring
>>>>>> >> Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. R=
obert Dahlke
>>>>>> >> Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
>

--001a114a91b2218462052d2507db
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I&#39;ve created an issue [1] and opened a PR [2] to fix t=
he issue.<div><br></div><div>[1] <a href=3D"https://issues.apache.org/jira/=
browse/FLINK-3570">https://issues.apache.org/jira/browse/FLINK-3570</a></di=
v><div>[2]=C2=A0<a href=3D"https://github.com/apache/flink/pull/1758">https=
://github.com/apache/flink/pull/1758</a></div><div><br></div><div>Cheers,</=
div><div>Till<br><div><br></div></div></div><div class=3D"gmail_extra"><br>=
<div class=3D"gmail_quote">On Thu, Mar 3, 2016 at 12:33 PM, Maximilian Bode=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:maximilian.bode@tngtech.com" targe=
t=3D"_blank">maximilian.bode@tngtech.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div style=3D"word-wrap:break-word"><div>Hi Ufuk, Til=
l and Stephan,</div><div><br></div>Yes, that is what we observed. The prima=
ry hostname, i.e. the one returned by the unix hostname command, is in fact=
 bound to the eth0 interface, whereas Flink uses the eth1 interface (pertai=
ning to another hostname).<div><br></div><div>Changing akka.lookup.timeout =
to 100 s seems to fix the problem as now the new job manager is available i=
n sufficient time. I still would agree with Stephan on taking the local hos=
tname being the preferred strategy.</div><span class=3D""><div><br></div><d=
iv>Cheers,</div><div>=C2=A0Max</div><div>
<div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wra=
p:break-word"><div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;word-wrap:break-word"><div style=3D"color:rgb(0,0,0);letter-spacing:n=
ormal;text-align:start;text-indent:0px;text-transform:none;white-space:norm=
al;word-spacing:0px;word-wrap:break-word"><div style=3D"color:rgb(0,0,0);le=
tter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;wh=
ite-space:normal;word-spacing:0px;word-wrap:break-word"><div style=3D"color=
:rgb(0,0,0);letter-spacing:normal;text-align:start;text-indent:0px;text-tra=
nsform:none;white-space:normal;word-spacing:0px;word-wrap:break-word"><div =
style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:start;text-inden=
t:0px;text-transform:none;white-space:normal;word-spacing:0px;word-wrap:bre=
ak-word"><div style=3D"color:rgb(0,0,0);letter-spacing:normal;text-align:st=
art;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px=
;word-wrap:break-word">=E2=80=94=C2=A0</div><div style=3D"color:rgb(0,0,0);=
letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;=
white-space:normal;word-spacing:0px;word-wrap:break-word">Maximilian Bode *=
 Junior Consultant * <a href=3D"mailto:maximilian.bode@tngtech.com" target=
=3D"_blank">maximilian.bode@tngtech.com</a><div>TNG Technology Consulting G=
mbH, Betastr. 13a, 85774 Unterf=C3=B6hring</div><div>Gesch=C3=A4ftsf=C3=BCh=
rer: Henrik Klagges, Christoph Stock, Dr. Robert Dahlke</div><div>Sitz: Unt=
erf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082</div></div></div></d=
iv></div></div></div></div>
</div>
<br></span><div><div class=3D"h5"><div><blockquote type=3D"cite"><div>Am 03=
.03.2016 um 12:29 schrieb Till Rohrmann &lt;<a href=3D"mailto:till.rohrmann=
@gmail.com" target=3D"_blank">till.rohrmann@gmail.com</a>&gt;:</div><br><di=
v><div dir=3D"ltr">No I don&#39;t think this behaviour has been introduced =
by HA. That is the default behaviour we used for a long time. If you think =
we should still change it, then I can open an issue for it.</div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Mar 3, 2016 at 12:2=
0 PM, Stephan Ewen <span dir=3D"ltr">&lt;<a href=3D"mailto:sewen@apache.org=
" target=3D"_blank">sewen@apache.org</a>&gt;</span> wrote:<br><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;=
padding-left:1ex"><div dir=3D"ltr">Okay, that is a change from the original=
 behavior, introduced in HA. Originally, if the connection attempts failed,=
 it always returned the=C2=A0<span style=3D"font-family:Consolas,Inconsolat=
a,Courier,monospace;font-size:10.88px;white-space:pre-wrap;background-color=
:rgb(248,248,248)">InetAddress.getLocalHost() </span>interface.<div>I think=
 we should change it back to that, because that interface is by far the bes=
t possible heuristic.</div></div><div><div><div class=3D"gmail_extra"><br><=
div class=3D"gmail_quote">On Thu, Mar 3, 2016 at 11:39 AM, Till Rohrmann <s=
pan dir=3D"ltr">&lt;<a href=3D"mailto:trohrmann@apache.org" target=3D"_blan=
k">trohrmann@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex"><div dir=3D"ltr"><div><p style=3D"margin:0px 0px 1.2em!important">If I=
=E2=80=99m not mistaken, then it=E2=80=99s not necessarily true that the he=
uristic returns <code style=3D"font-size:0.85em;font-family:Consolas,Incons=
olata,Courier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre=
-wrap;border:1px solid rgb(234,234,234);border-radius:3px;display:inline;ba=
ckground-color:rgb(248,248,248)">InetAddress.getLocalHost()</code> in all c=
ases. The heuristic will select the first network interface with the afore-=
mentioned conditions but before returning it, it will try a last time to co=
nnect to the JM via the interface bound to <code style=3D"font-size:0.85em;=
font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;paddin=
g:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border-r=
adius:3px;display:inline;background-color:rgb(248,248,248)">InetAddress.get=
LocalHost()</code>. However, if this fails, then the heuristically selected=
 network interface will be returned.</p>
<div title=3D"MDH:SWYgSSdtIG5vdCBtaXN0YWtlbiwgdGhlbiBpdCdzIG5vdCBuZWNlc3Nhc=
mlseSB0cnVlIHRoYXQg
dGhlIGhldXJpc3RpYyByZXR1cm5zIGBJbmV0QWRkcmVzcy5nZXRMb2NhbEhvc3QoKWAgaW4gYWx=
s
IGNhc2VzLiBUaGUgaGV1cmlzdGljIHdpbGwgc2VsZWN0IHRoZSBmaXJzdCBuZXR3b3JrIGludGV=
y
ZmFjZSB3aXRoIHRoZSBhZm9yZS1tZW50aW9uZWQgY29uZGl0aW9ucyBidXQgYmVmb3JlIHJldHV=
y
bmluZyBpdCwgaXQgd2lsbCB0cnkgYSBsYXN0IHRpbWUgdG8gY29ubmVjdCB0byB0aGUgSk0gdml=
h
IHRoZSBpbnRlcmZhY2UgYm91bmQgdG8gYEluZXRBZGRyZXNzLmdldExvY2FsSG9zdCgpYC4gSG9=
3
ZXZlciwgaWYgdGhpcyBmYWlscywgdGhlbiB0aGUgaGV1cmlzdGljYWxseSBzZWxlY3RlZCBuZXR=
3
b3JrIGludGVyZmFjZSB3aWxsIGJlIHJldHVybmVkLg=3D=3D" style=3D"min-height:0;wid=
th:0;max-height:0;max-width:0;overflow:hidden;font-size:0em;padding:0;margi=
n:0">=E2=80=8B</div></div></div><div><div><div class=3D"gmail_extra"><br><d=
iv class=3D"gmail_quote">On Thu, Mar 3, 2016 at 10:49 AM, Stephan Ewen <spa=
n dir=3D"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank">sew=
en@apache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr">If the ThasManager cannot connect to the JobManager, it will us=
e the interface that is bound to the machine&#39;s host name (&quot;InetAdd=
ress.getLocalHost()&quot;).<div><br></div><div>So, the best way to fix this=
 would be to make sure that all machines have a proper network configuratio=
n. Then Flink would either use an address that can connect (via trying vari=
ous interfaces), or it would default back to the hostname/interface that is=
 configured on the machine.</div><div><br></div></div><div><div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Mar 3, 2016 at 10:4=
3 AM, Till Rohrmann <span dir=3D"ltr">&lt;<a href=3D"mailto:trohrmann@apach=
e.org" target=3D"_blank">trohrmann@apache.org</a>&gt;</span> wrote:<br><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #c=
cc solid;padding-left:1ex"><div dir=3D"ltr"><div><p style=3D"margin:0px 0px=
 1.2em!important">Hi Max,</p><p style=3D"margin:0px 0px 1.2em!important">th=
e problem is that before starting the TM, we have to find the network inter=
face which is reachable by the other machines. So what we do is to connect =
to the current JobManager. If it should happen, as in your case, that the J=
obManager just died and the new JM address has not been written to ZooKeepe=
r, then the TMs don=E2=80=99t have much choice other than using the heurist=
ic.</p><p style=3D"margin:0px 0px 1.2em!important">I can=E2=80=99t really t=
ell why eth1 is chosen over eth0. The condition is that the interface addre=
ss is an <code style=3D"font-size:0.85em;font-family:Consolas,Inconsolata,C=
ourier,monospace;margin:0px 0.15em;padding:0px 0.3em;white-space:pre-wrap;b=
order:1px solid rgb(234,234,234);border-radius:3px;display:inline;backgroun=
d-color:rgb(248,248,248)">Inet4Address</code>, no link local address as wel=
l as not a loopback address.</p><p style=3D"margin:0px 0px 1.2em!important"=
>Thus, Ufuk=E2=80=99s solution, to increase <code style=3D"font-size:0.85em=
;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;paddi=
ng:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border-=
radius:3px;display:inline;background-color:rgb(248,248,248)">akka.lookup.ti=
meout</code> seems to be the easiest solution to solve your problem. I=E2=
=80=99ve checked the default value is set to <code style=3D"font-size:0.85e=
m;font-family:Consolas,Inconsolata,Courier,monospace;margin:0px 0.15em;padd=
ing:0px 0.3em;white-space:pre-wrap;border:1px solid rgb(234,234,234);border=
-radius:3px;display:inline;background-color:rgb(248,248,248)">10 s</code> w=
hich might be a bit too low for restarting a new JM and publishing its addr=
ess via ZooKeeper.</p><p style=3D"margin:0px 0px 1.2em!important">Cheers,<b=
r>Till</p>
<div title=3D"MDH:SGkgTWF4LDxkaXY+PGJyPjwvZGl2PjxkaXY+dGhlIHByb2JsZW0gaXMgd=
GhhdCBiZWZvcmUgc3Rh
cnRpbmcgdGhlIFRNLCB3ZSBoYXZlIHRvIGZpbmQgdGhlIG5ldHdvcmsgaW50ZXJmYWNlIHdoaWN=
o
IGlzIHJlYWNoYWJsZSBieSB0aGUgb3RoZXIgbWFjaGluZXMuIFNvIHdoYXQgd2UgZG8gaXMgdG8=
g
Y29ubmVjdCB0byB0aGUgY3VycmVudCBKb2JNYW5hZ2VyLiBJZiBpdCBzaG91bGQgaGFwcGVuLCB=
h
cyBpbiB5b3VyIGNhc2UsIHRoYXQgdGhlIEpvYk1hbmFnZXIganVzdCBkaWVkIGFuZCB0aGUgbmV=
3
IEpNIGFkZHJlc3MgaGFzIG5vdCBiZWVuIHdyaXR0ZW4gdG8gWm9vS2VlcGVyLCB0aGVuIHRoZSB=
U
TXMgZG9uJ3QgaGF2ZSBtdWNoIGNob2ljZSBvdGhlciB0aGFuIHVzaW5nIHRoZSBoZXVyaXN0aWM=
u
PC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj5JIGNhbid0IHJlYWxseSB0ZWxsIHdoeSBldGgxIGl=
z
IGNob3NlbiBvdmVyIGV0aDAuIFRoZSBjb25kaXRpb24gaXMgdGhhdCB0aGUgaW50ZXJmYWNlIGF=
k
ZHJlc3MgaXMgYW4gYEluZXQ0QWRkcmVzc2AsIG5vIGxpbmsgbG9jYWwgYWRkcmVzcyBhcyB3ZWx=
s
IGFzIG5vdCBhIGxvb3BiYWNrIGFkZHJlc3MuPC9kaXY+PGRpdj48YnI+PC9kaXY+PGRpdj5UaHV=
z
LCBVZnVrJ3Mgc29sdXRpb24sIHRvIGluY3JlYXNlIGBha2thLmxvb2t1cC50aW1lb3V0YCBzZWV=
t
cyB0byBiZSB0aGUgZWFzaWVzdCBzb2x1dGlvbiB0byBzb2x2ZSB5b3VyIHByb2JsZW0uIEkndmU=
g
Y2hlY2tlZCB0aGUgZGVmYXVsdCB2YWx1ZSBpcyBzZXQgdG8gYDEwIHNgIHdoaWNoIG1pZ2h0IGJ=
l
IGEgYml0IHRvbyBsb3cgZm9yIHJlc3RhcnRpbmcgYSBuZXcgSk0gYW5kIHB1Ymxpc2hpbmcgaXR=
z
IGFkZHJlc3MgdmlhIFpvb0tlZXBlci48L2Rpdj48ZGl2Pjxicj48L2Rpdj48ZGl2PkNoZWVycyw=
8
L2Rpdj48ZGl2PlRpbGw8L2Rpdj4=3D" style=3D"min-height:0;width:0;max-height:0;=
max-width:0;overflow:hidden;font-size:0em;padding:0;margin:0">=E2=80=8B</di=
v></div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_=
quote">On Thu, Mar 3, 2016 at 10:28 AM, Ufuk Celebi <span dir=3D"ltr">&lt;<=
a href=3D"mailto:uce@apache.org" target=3D"_blank">uce@apache.org</a>&gt;</=
span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8e=
x;border-left:1px #ccc solid;padding-left:1ex">I had an offline chat with T=
ill about this. He pointed out that the<br>
address is chosen once at start up time (while not being able to<br>
connect to the old job manager) and then it stays fixed at eth1.<br>
<br>
You can increase the lookup timeout by setting akka.lookup.timeout to<br>
a higher value (like 100 s). This is the only workaroud I&#39;m aware of<br=
>
at this point. Maybe Till can chime in here whether this has other<br>
implications as well?<br>
<span><font color=3D"#888888"><br>
=E2=80=93 Ufuk<br>
</font></span><div><div><br>
On Thu, Mar 3, 2016 at 9:59 AM, Ufuk Celebi &lt;<a href=3D"mailto:uce@apach=
e.org" target=3D"_blank">uce@apache.org</a>&gt; wrote:<br>
&gt; Hey Max!<br>
&gt;<br>
&gt; for the first WARN in<br>
&gt; org.apache.flink.runtime.webmonitor.JobManagerRetriever: this is<br>
&gt; expected if the new leader has not updated ZooKeeper yet. The<br>
&gt; important thing is that the new leading job manager is eventually<br>
&gt; retrieved. This did happen, right?<br>
&gt;<br>
&gt; Regarding eth1 vs. eth0: After the new job manager becomes leader, the=
<br>
&gt; task manager should re-try connecting to it with the same strategy as<=
br>
&gt; in the initial connection establishment (e.g. try SLOW first and only<=
br>
&gt; fall back to HEURISTIC). Can you see in the logs whether this happens?=
<br>
&gt;<br>
&gt; The best thing would be to share the complete logs. Is this possible?<=
br>
&gt; If not publicly, feel free to send it to me privately (uce at apache<b=
r>
&gt; org).<br>
&gt;<br>
&gt; =E2=80=93 Ufuk<br>
&gt;<br>
&gt;<br>
&gt; On Thu, Mar 3, 2016 at 9:21 AM, Maximilian Bode<br>
&gt; &lt;<a href=3D"mailto:maximilian.bode@tngtech.com" target=3D"_blank">m=
aximilian.bode@tngtech.com</a>&gt; wrote:<br>
&gt;&gt; Hi everyone,<br>
&gt;&gt;<br>
&gt;&gt; we are trying to get to work JobManager HA in the context of a per=
-job YARN<br>
&gt;&gt; session using the 1.0.0-rc3 from a few days ago and are having a p=
roblem<br>
&gt;&gt; concerning task managers with several network interfaces.<br>
&gt;&gt;<br>
&gt;&gt; After manually killing the job manager process, the jobmanager.log=
 on the<br>
&gt;&gt; newly allocated second job manager reads:<br>
&gt;&gt; ---<br>
&gt;&gt; 2016-03-02 18:01:09,635 WARN=C2=A0 Remoting<br>
&gt;&gt; - Tried to associate with unreachable remote address<br>
&gt;&gt; [akka.tcp://<a href=3D"http://flink@10.127.68.136:34811/" rel=3D"n=
oreferrer" target=3D"_blank">flink@10.127.68.136:34811</a>]. Address is now=
 gated for 5000 ms,<br>
&gt;&gt; all messages to this address will be delivered to dead letters. Re=
ason:<br>
&gt;&gt; Connection refused: /<a href=3D"http://10.127.68.136:34811/" rel=
=3D"noreferrer" target=3D"_blank">10.127.68.136:34811</a><br>
&gt;&gt; 2016-03-02 18:01:09,644 WARN<br>
&gt;&gt; org.apache.flink.runtime.webmonitor.JobManagerRetriever=C2=A0 =C2=
=A0 =C2=A0 =C2=A0- Failed to<br>
&gt;&gt; retrieve leader gateway and port.<br>
&gt;&gt; akka.actor.ActorNotFound: Actor not found for:<br>
&gt;&gt; ActorSelection[Anchor(akka.tcp://<a href=3D"http://flink@10.127.68=
.136:34811/" rel=3D"noreferrer" target=3D"_blank">flink@10.127.68.136:34811=
/</a>),<br>
&gt;&gt; Path(/user/jobmanager)]<br>
&gt;&gt; at<br>
&gt;&gt; akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelecti=
on.scala:65)<br>
&gt;&gt; at<br>
&gt;&gt; akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelecti=
on.scala:63)<br>
&gt;&gt; at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)<br=
>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1=
(BatchingExecutor.scala:67)<br>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(B=
atchingExecutor.scala:82)<br>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(Batching=
Executor.scala:59)<br>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(Batching=
Executor.scala:59)<br>
&gt;&gt; at scala.concurrent.BlockContext$.withBlockContext(BlockContext.sc=
ala:72)<br>
&gt;&gt; at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala=
:58)<br>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatc=
hedExecute(Future.scala:74)<br>
&gt;&gt; at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.s=
cala:110)<br>
&gt;&gt; at<br>
&gt;&gt; akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execut=
e(Future.scala:73)<br>
&gt;&gt; at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise=
.scala:40)<br>
&gt;&gt; at<br>
&gt;&gt; scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.s=
cala:248)<br>
&gt;&gt; at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)<br>
&gt;&gt; at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)=
<br>
&gt;&gt; at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)=
<br>
&gt;&gt; at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)<br>
&gt;&gt; at<br>
&gt;&gt; akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(=
RemoteActorRefProvider.scala:87)<br>
&gt;&gt; at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)<br>
&gt;&gt; at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)<br>
&gt;&gt; at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)<br=
>
&gt;&gt; at<br>
&gt;&gt; akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHan=
dling$$finishTerminate(FaultHandling.scala:210)<br>
&gt;&gt; at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.=
scala:172)<br>
&gt;&gt; at akka.actor.ActorCell.terminate(ActorCell.scala:369)<br>
&gt;&gt; at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)<br>
&gt;&gt; at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)<br>
&gt;&gt; at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:27=
9)<br>
&gt;&gt; at akka.dispatch.Mailbox.run(Mailbox.scala:220)<br>
&gt;&gt; at akka.dispatch.Mailbox.exec(Mailbox.scala:231)<br>
&gt;&gt; at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java=
:260)<br>
&gt;&gt; at<br>
&gt;&gt; scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinP=
ool.java:1339)<br>
&gt;&gt; at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.j=
ava:1979)<br>
&gt;&gt; at<br>
&gt;&gt; scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerT=
hread.java:107)<br>
&gt;&gt; ---<br>
&gt;&gt; where the IP not found is from the old job manager. So far, is thi=
s the<br>
&gt;&gt; expected behavior?<br>
&gt;&gt;<br>
&gt;&gt; The problem then arises on a new task manager, which also tries to=
 connect<br>
&gt;&gt; to the old job manager unsuccessfully. The ZooKeeperLeaderRetrieva=
lService<br>
&gt;&gt; starts cycling through the available network interfaces, as can be=
 seen in<br>
&gt;&gt; the relevant taskmanager.log:<br>
&gt;&gt; ---<br>
&gt;&gt; 2016-03-02 18:01:13,636 INFO<br>
&gt;&gt; org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalS=
ervice=C2=A0 -<br>
&gt;&gt; Starting ZooKeeperLeaderRetrievalService.<br>
&gt;&gt; 2016-03-02 18:01:13,646 INFO<br>
&gt;&gt; org.apache.flink.runtime.util.LeaderRetrievalUtils=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - Trying to<br>
&gt;&gt; select the network interface and address to use by connecting to t=
he leading<br>
&gt;&gt; JobManager.<br>
&gt;&gt; 2016-03-02 18:01:13,646 INFO<br>
&gt;&gt; org.apache.flink.runtime.util.LeaderRetrievalUtils=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 - TaskManager<br>
&gt;&gt; will try to connect for 10000 milliseconds before falling back to =
heuristics<br>
&gt;&gt; 2016-03-02 18:01:13,712 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Retrieved new target address /<a href=3D"http://10.127.68.136:34=
811/" rel=3D"noreferrer" target=3D"_blank">10.127.68.136:34811</a>.<br>
&gt;&gt; 2016-03-02 18:01:14,079 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Trying to connect to address /<a href=3D"http://10.127.68.136:34=
811/" rel=3D"noreferrer" target=3D"_blank">10.127.68.136:34811</a><br>
&gt;&gt; 2016-03-02 18:01:14,082 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address<br>
&gt;&gt; &#39;<a href=3D"http://task.manager.eth0.hostname.com/10.127.68.13=
6" rel=3D"noreferrer" target=3D"_blank">task.manager.eth0.hostname.com/10.1=
27.68.136</a>&#39;: Connection refused<br>
&gt;&gt; 2016-03-02 18:01:14,082 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://10.127.68=
.136/" rel=3D"noreferrer" target=3D"_blank">10.127.68.136</a>&#39;: Connect=
ion refused<br>
&gt;&gt; 2016-03-02 18:01:14,082 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://10.120.19=
3.110/" rel=3D"noreferrer" target=3D"_blank">10.120.193.110</a>&#39;: Conne=
ction refused<br>
&gt;&gt; 2016-03-02 18:01:14,082 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://10.127.68=
.136/" rel=3D"noreferrer" target=3D"_blank">10.127.68.136</a>&#39;: Connect=
ion refused<br>
&gt;&gt; 2016-03-02 18:01:14,083 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://127.0.0.1=
/" rel=3D"noreferrer" target=3D"_blank">127.0.0.1</a>&#39;: Connection refu=
sed<br>
&gt;&gt; 2016-03-02 18:01:14,083 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://10.120.19=
3.110/" rel=3D"noreferrer" target=3D"_blank">10.120.193.110</a>&#39;: Conne=
ction refused<br>
&gt;&gt; 2016-03-02 18:01:14,083 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://10.127.68=
.136/" rel=3D"noreferrer" target=3D"_blank">10.127.68.136</a>&#39;: Connect=
ion refused<br>
&gt;&gt; 2016-03-02 18:01:14,083 INFO=C2=A0 org.apache.flink.runtime.net.Co=
nnectionUtils<br>
&gt;&gt; - Failed to connect from address &#39;/<a href=3D"http://127.0.0.1=
/" rel=3D"noreferrer" target=3D"_blank">127.0.0.1</a>&#39;: Connection refu=
sed<br>
&gt;&gt; ---<br>
&gt;&gt; After five repetitions, the task manager stops trying to retrieve =
the leader<br>
&gt;&gt; and using the HEURISTIC strategy ends up using=C2=A0 eth1 (10.120.=
193.110) from<br>
&gt;&gt; now on:<br>
&gt;&gt; ---<br>
&gt;&gt; 2016-03-02 18:01:23,650 INFO<br>
&gt;&gt; org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalS=
ervice=C2=A0 -<br>
&gt;&gt; Stopping ZooKeeperLeaderRetrievalService.<br>
&gt;&gt; 2016-03-02 18:01:23,655 INFO=C2=A0 org.apache.zookeeper.ClientCnxn=
<br>
&gt;&gt; - EventThread shut down<br>
&gt;&gt; 2016-03-02 18:01:23,655 INFO=C2=A0 org.apache.zookeeper.ZooKeeper<=
br>
&gt;&gt; - Session: 0x25229757cff035b closed<br>
&gt;&gt; 2016-03-02 18:01:23,664 INFO<br>
&gt;&gt; org.apache.flink.runtime.taskmanager.TaskManager=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - TaskManager<br>
&gt;&gt; will use hostname/address &#39;<a href=3D"http://task.manager.eth1=
.hostname.com/" rel=3D"noreferrer" target=3D"_blank">task.manager.eth1.host=
name.com</a>&#39; (10.120.193.110)<br>
&gt;&gt; for communication.<br>
&gt;&gt; ---<br>
&gt;&gt; Following the new jobmanager is discovered and the taskmanager is =
able to<br>
&gt;&gt; register at the jobmanager using eth1. The problem is that connect=
ions TO<br>
&gt;&gt; eth1 are not possible. So flink should always use eth0. The except=
ion we<br>
&gt;&gt; later see is:<br>
&gt;&gt; ---<br>
&gt;&gt; java.io.IOException: Connecting the channel failed: Connecting to =
remote<br>
&gt;&gt; task manager + &#39;other.task.manager.eth1.hostname/<a href=3D"ht=
tp://10.120.193.111:46620/" rel=3D"noreferrer" target=3D"_blank">10.120.193=
.111:46620</a>&#39; has<br>
&gt;&gt; failed. This might indicate that the remote task manager has been =
lost.<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.netty.PartitionRequestClientFa=
ctory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:1=
96)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.netty.PartitionRequestClientFa=
ctory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)<=
br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.netty.PartitionRequestClientFa=
ctory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)<b=
r>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.netty.NettyConnectionManager.c=
reatePartitionRequestClient(NettyConnectionManager.java:60)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.partition.consumer.RemoteInput=
Channel.requestSubpartition(RemoteInputChannel.java:115)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.partition.consumer.SingleInput=
Gate.requestPartitions(SingleInputGate.java:388)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.runtime.io.network.partition.consumer.SingleInput=
Gate.getNextBufferOrEvent(SingleInputGate.java:411)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBloc=
ked(BarrierBuffer.java:108)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.streaming.runtime.io.StreamInputProcessor.process=
Input(StreamInputProcessor.java:175)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(On=
eInputStreamTask.java:65)<br>
&gt;&gt; at<br>
&gt;&gt; org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamT=
ask.java:224)<br>
&gt;&gt; at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)<br=
>
&gt;&gt; at java.lang.Thread.run(Thread.java:744)<br>
&gt;&gt; ---<br>
&gt;&gt; The root cause seems to be that network interface selection is sti=
ll using<br>
&gt;&gt; the old jobmanager location and hence is not able to choose the ri=
ght<br>
&gt;&gt; interface. In particular, it seems that iteration order over the n=
etwork<br>
&gt;&gt; interfaces differs between the HEURISTIC and SLOW strategy, which =
then leads<br>
&gt;&gt; to the wrong interface being selected.<br>
&gt;&gt;<br>
&gt;&gt; Cheers,<br>
&gt;&gt;=C2=A0 Max<br>
&gt;&gt; =E2=80=94<br>
&gt;&gt; Maximilian Bode * Junior Consultant * <a href=3D"mailto:maximilian=
.bode@tngtech.com" target=3D"_blank">maximilian.bode@tngtech.com</a><br>
&gt;&gt; TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterf=C3=B6hr=
ing<br>
&gt;&gt; Gesch=C3=A4ftsf=C3=BChrer: Henrik Klagges, Christoph Stock, Dr. Ro=
bert Dahlke<br>
&gt;&gt; Sitz: Unterf=C3=B6hring * Amtsgericht M=C3=BCnchen * HRB 135082<br=
>
&gt;&gt;<br>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></blockquote></div><br></div></div></div></blockquote></div><br></div=
>

--001a114a91b2218462052d2507db--