flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: strange behavior with jobmanager.rpc.address on standalone HA cluster
Date Mon, 07 May 2018 12:38:25 GMT
Hi Derek,

1. I've created a JIRA issue to improve the docs as you recommended [1].

2. This discussion goes quite a bit into the internals of the HA setup. Let
me pull in Till (in CC) who knows the details of HA.

Best, Fabian

[1] https://issues.apache.org/jira/browse/FLINK-9309

2018-05-05 15:34 GMT+02:00 Derek VerLee <derekverlee@gmail.com>:

> Two things:
> 1. It would be beneficial I think to drop a line somewhere in the docs
> (probably on the production ready checklist as well as the HA page)
> explaining that enabling zookeeper "highavailability" allows for your jobs
> to restart automatically after a jobmanager crash or restart.  We had spent
> some cycles trying to implement job restarting and watchdogs (poorly) when
> I discoverd this from a flink forward presentation on youtube.
> 2. I seem to have found some odd behavior with HA and then found something
> that works, but I can't explain why.  The clifnotes version is that I took
> an existing standalone cluster with a single JM and modified with high
> availability zookeeper mode.  The same flink-conf.yaml file is used on all
> nodes (including JM). This seemed to work fine, I restarted the JM (jm0)
> and the jobs relaunched when it came back.  Easy!  Then I deployed a second
> JM (jm1).  Once I modified `masters`, set the HA rpc port range and opened
> those ports on the firewall for both jobmanagers, but left
> `jobmanager.rpc.address` the original value, `jm0` on all nodes.  I then
> observed that jm0 worked fine, taskmanagers connected to it and jobs ran.
> jm1 did not 301 me to jm0 however, it displayed a dashboard (no jobs, no
> tm).  When I stopped jm0, the jobs show up on jm1 as RESTARTING, but the
> taskmanagers never attach to jm1.   In the logs, all nodes, including jm1,
> had messages about trying to reach jm0.  From the documentation and various
> comments I've seen, `jobmanager.rpc.address` should be ignored.  However,
> commenting it out entirely lead to jobmanagers crashing at boot, setting to
> `localhost` caused all the taskmanagers to log messages about trying to
> connect to the jobmanager at localhost.  What finally worked was to set the
> value to the hostname where the flink-conf.yaml was individually, even on
> the taskmanagers.
> Does this seem like a bug?
> Just a hunch, but is there something called an "akka leader" that is
> different from the jobmanager leader, and could it be somehow defaulting
> its value over to jobmanager.rpc.address?

View raw message