Nice to hear :-)

Best,
tison.


Aleksandar Mastilovic <amastilovic@sightmachine.com> 于2019年8月23日周五 上午2:22写道:
Thanks for all the help, people - you made me go through my code once again and discover that I switched argument positions for job manager and resource manager addresses :-)

The docker ensemble now starts fine, I’m working on ironing out the bugs now.

I’ll participate in the survey too!

On Aug 21, 2019, at 7:18 PM, Zili Chen <wander4096@gmail.com> wrote:

Besides, would you like to participant our survey thread[1] on
user list about "How do you use high-availability services in Flink?"

It would help Flink improve its high-availability serving.

Best,
tison.


Zili Chen <wander4096@gmail.com> 于2019年8月22日周四 上午10:16写道:
Hi Aleksandar,

base on your log:

taskmanager_1   | 2019-08-22 00:05:03,713 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000).
taskmanager_1   | 2019-08-22 00:05:04,137 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager..

it looks like you return a jobmanager address on retrieval service of resource manager. Please check the implementation carefully or share it on mailing list that others can help for investigation.

Best,
tison.


Zhu Zhu <reedpor@gmail.com> 于2019年8月22日周四 上午10:11写道:
Hi Aleksandar,

The resource manager address is retrieved from the HA services.
Would you check whether your customized HA services is returning the right  LeaderRetrievalService and whether the LeaderRetrievalService is really retrieving the right leader's address?
Or is it possible that the stored resource manager address in HA is replaced by jobmanager address in any case?

Thanks,
Zhu Zhu

Aleksandar Mastilovic <amastilovic@sightmachine.com> 于2019年8月22日周四 上午8:16写道:
Hi all,

I’m experimenting with using my own implementation of HA services instead of ZooKeeper that would persist JobManager information on a Kubernetes volume instead of in ZooKeeper.

I’ve set the high-availability option in flink-conf.yaml to the FQN of my factory class, and started the docker ensemble as I usually do (i.e. with no special “cluster” arguments or scripts.)

What’s happening now is that TaskManager is unable to connect to ResourceManager, because it seems it’s trying to use the /user/jobmanager path instead of /user/resourcemanager.

Here’s what I found in the logs:


jobmanager_1    | 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting                                          - Remoting started; listening on addresses :[akka.tcp://flink@jobmanager:6123]
jobmanager_1    | 2019-08-22 00:05:00,975 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils         - Actor system started at akka.tcp://flink@jobmanager:6123

jobmanager_1    | 2019-08-22 00:05:02,380 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/resourcemanager .

jobmanager_1    | 2019-08-22 00:05:03,138 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .

jobmanager_1    | 2019-08-22 00:05:03,211 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000

jobmanager_1    | 2019-08-22 00:05:03,292 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher akka.tcp://flink@jobmanager:6123/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000

taskmanager_1   | 2019-08-22 00:05:03,713 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000).
taskmanager_1   | 2019-08-22 00:05:04,137 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager..

Is this a known bug? I’d appreciate any help I can get.

Thanks,
Aleksandar Mastilovic