flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Yao <g...@data-artisans.com>
Subject Re: Cannot submit jobs on a HA Standalone JobManager
Date Thu, 03 May 2018 12:36:28 GMT
Hi Julio,

Are you using the -m flag of "bin/flink run" by any chance? In HA mode, you
cannot manually specify the JobManager address. The client determines the
leader
through ZooKeeper. If you did not configure the ZooKeeper quorum in the
flink-conf.yaml on the machine from which you are submitting, this might
explain
the error message.

> But that didn't solve my problem. So far, the `flink run` still fails
with the same message (I'm adding the full stacktrace of the failure in the
end, just in case), but now I'm also seeing this message in the JobManager
logs:
Unfortunately, the error message in your previous email is different. If the
above does not solve your problem, can you attach the logs of the client and
JobManager?

Lastly, what Flink version are you running?

Best,
Gary

On Wed, May 2, 2018 at 6:51 PM, Julio Biason <julio.biason@azion.com> wrote:

> Hey guys and gals,
>
> So, after a bit more digging, I found out that once HA is enabled,
> `jobmanager.rpc.port` is also ignore (along with `jobmanager.rpc.address`,
> but I was expecting this). Because I set the `high-availability.jobmanager.port`
> to `50010-50015`, my RPC port also changed (the docs made me think this
> would only affect the HA communication, not ALL communications). This can
> be checked on the Dashboard, under the JobManager configuration option.
>
> But that didn't solve my problem. So far, the `flink run` still fails with
> the same message (I'm adding the full stacktrace of the failure in the end,
> just in case), but now I'm also seeing this message in the JobManager logs:
>
> 2018-05-02 16:44:32,373 WARN  org.apache.flink.runtime.
> jobmanager.JobManager                - Discard message
> LeaderSessionMessage(00000000-0000-0000-0000-000000000000,SubmitJob(JobGraph(jobId:
> 42a25752ab085117a21c02d3db54777e),DETACHED)) because the expected leader
> session ID c01eba4f-44e2-4c65-85d5-a9a05ceba28e did not equal the
> received leader session ID 00000000-0000-0000-0000-
> 000000000000.
>
>
> So, I'm still lost on where to go forward.
>
>
> Failure when using `flink run`:
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: JobManager did not respond within 60000
> ms
>
>         at org.apache.flink.client.program.ClusterClient.
> runDetached(ClusterClient.java:524)
>         at org.apache.flink.client.program.StandaloneClusterClient.
> submitJob(StandaloneClusterClient.java:103)
>         at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:456)
>         at org.apache.flink.client.program.DetachedEnvironment.
> finalizeExecute(DetachedEnvironment.java:77)
>         at org.apache.flink.client.program.ClusterClient.run(
> ClusterClient.java:402)
>         at org.apache.flink.client.CliFrontend.executeProgram(
> CliFrontend.java:802)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>         at org.apache.flink.client.CliFrontend.parseParameters(
> CliFrontend.java:1054)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1101)
>         at org.apache.flink.client.CliFrontend$1.call(
> CliFrontend.java:1098)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1698)
>         at org.apache.flink.runtime.security.HadoopSecurityContext.
> runSecured(HadoopSecurityContext.java:41)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: org.apache.flink.runtime.client.JobTimeoutException:
> JobManager did not respond within 60000 ms
>         at org.apache.flink.runtime.client.JobClient.
> submitJobDetached(JobClient.java:437)
>         at org.apache.flink.client.program.ClusterClient.
> runDetached(ClusterClient.java:516)
>         ... 14 more
> Caused by: java.util.concurrent.TimeoutException
>         at java.util.concurrent.CompletableFuture.timedGet(
> CompletableFuture.java:1771)
>         at java.util.concurrent.CompletableFuture.get(
> CompletableFuture.java:1915)
>         at org.apache.flink.runtime.client.JobClient.
> submitJobDetached(JobClient.java:435)
>         ... 15 more
>
>
> On Wed, May 2, 2018 at 9:52 AM, Julio Biason <julio.biason@azion.com>
> wrote:
>
>> Hello all,
>>
>> I'm building a standalone cluster with HA JobManager. So far, everything
>> seems to work, but when i try to `flink run` my job, it fails with the
>> following error:
>>
>> Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
>> Could not retrieve the leader gateway.
>>
>> So far, I have two different machines running the JobManager and, looking
>> at the logs, I can't see any problem whatsoever to explain why the flink
>> command is refusing to run the pipeline...
>>
>> Any ideas where I should look?
>>
>> --
>> *Julio Biason*, Sofware Engineer
>> *AZION*  |  Deliver. Accelerate. Protect.
>> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
>> <callto:+5551996209291>*99907 0554*
>>
>
>
>
> --
> *Julio Biason*, Sofware Engineer
> *AZION*  |  Deliver. Accelerate. Protect.
> Office: +55 51 3083 8101 <callto:+555130838101>  |  Mobile: +55 51
> <callto:+5551996209291>*99907 0554*
>

Mime
View raw message