flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Kania <jason.ka...@ymail.com>
Subject Leader Retrieval Timeout with HA Job Manager
Date Tue, 15 May 2018 03:15:52 GMT
Hi,

I am using the 1.4.2 release on ubuntu and attempting to make use of an HA Job Manager, but
unfortunately using HA functionality prevents job submission with the following error:

java.lang.RuntimeException: Failed to retrieve JobManager address
        at org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:308)
        at org.apache.flink.client.program.StandaloneClusterClient.getClusterIdentifier(StandaloneClusterClient.java:86)
        at org.apache.flink.client.CliFrontend.createClient(CliFrontend.java:921)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:264)
        at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve
the leader address and leader session ID.
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:113)
        at org.apache.flink.client.program.ClusterClient.getJobManagerAddress(ClusterClient.java:302)
        ... 8 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [60000 milliseconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:190)
        at scala.concurrent.Await.result(package.scala)
        at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderConnectionInfo(LeaderRetrievalUtils.java:111)
        ... 9 more

This seems to also be tied to problems in having the TaskManager register. I have to repeatedly
restart the TaskManager until it finally connects to the Job Manager. Most times it doesn't
connect and doesn't complain making the determination of the root cause more difficult. The
cluster is not busy and I have tried both with IP addresses and host names to determine if
name resolution issues were the cause, but both situations are the same.

I have also noticed that if 2 job managers are launched on different nodes in the same cluster,
they both come back with logging indicating that they are the leader so they are not talking
to each other effectively and the logging is not even indicating that they are even attempting
to talk with one another.

Lastly, the error "Could not retrieve the leader address and leader session ID." is a very
poor error because it does not tell where it is attempting to get the information from.

Any suggestions would be appreciated.

Thanks,

Jason

Mime
View raw message