flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aleksandar Mastilovic <amastilo...@sightmachine.com>
Subject Re: TaskManager not connecting to ResourceManager in HA mode
Date Thu, 22 Aug 2019 18:21:53 GMT
Thanks for all the help, people - you made me go through my code once again and discover that
I switched argument positions for job manager and resource manager addresses :-)

The docker ensemble now starts fine, I’m working on ironing out the bugs now.

I’ll participate in the survey too!

> On Aug 21, 2019, at 7:18 PM, Zili Chen <wander4096@gmail.com> wrote:
> 
> Besides, would you like to participant our survey thread[1] on
> user list about "How do you use high-availability services in Flink?"
> 
> It would help Flink improve its high-availability serving.
> 
> Best,
> tison.
> 
> [1] https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E
<https://lists.apache.org/x/thread.html/c0cc07197e6ba30b45d7709cc9e17d8497e5e3f33de504d58dfcafad@%3Cuser.flink.apache.org%3E>
> 
> Zili Chen <wander4096@gmail.com <mailto:wander4096@gmail.com>> 于2019年8月22日周四
上午10:16写道:
> Hi Aleksandar,
> 
> base on your log:
> 
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000)
<>.
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager
<>, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager
<>..
> 
> it looks like you return a jobmanager address on retrieval service of resource manager.
Please check the implementation carefully or share it on mailing list that others can help
for investigation.
> 
> Best,
> tison.
> 
> 
> Zhu Zhu <reedpor@gmail.com <mailto:reedpor@gmail.com>> 于2019年8月22日周四
上午10:11写道:
> Hi Aleksandar,
> 
> The resource manager address is retrieved from the HA services.
> Would you check whether your customized HA services is returning the right  LeaderRetrievalService
and whether the LeaderRetrievalService is really retrieving the right leader's address?
> Or is it possible that the stored resource manager address in HA is replaced by jobmanager
address in any case?
> 
> Thanks,
> Zhu Zhu
> 
> Aleksandar Mastilovic <amastilovic@sightmachine.com <mailto:amastilovic@sightmachine.com>>
于2019年8月22日周四 上午8:16写道:
> Hi all,
> 
> I’m experimenting with using my own implementation of HA services instead of ZooKeeper
that would persist JobManager information on a Kubernetes volume instead of in ZooKeeper.
> 
> I’ve set the high-availability option in flink-conf.yaml to the FQN of my factory class,
and started the docker ensemble as I usually do (i.e. with no special “cluster” arguments
or scripts.)
> 
> What’s happening now is that TaskManager is unable to connect to ResourceManager, because
it seems it’s trying to use the /user/jobmanager path instead of /user/resourcemanager.
> 
> Here’s what I found in the logs:
> 
> 
> jobmanager_1    | 2019-08-22 00:05:00,963 INFO  akka.remote.Remoting                
                         - Remoting started; listening on addresses :[akka.tcp://flink@jobmanager:6123
<>]
> jobmanager_1    | 2019-08-22 00:05:00,975 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils
        - Actor system started at akka.tcp://flink@jobmanager:6123 <>
> 
> jobmanager_1    | 2019-08-22 00:05:02,380 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
             - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
at akka://flink/user/resourcemanager <> .
> 
> jobmanager_1    | 2019-08-22 00:05:03,138 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService
             - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
at akka://flink/user/dispatcher <> .
> 
> jobmanager_1    | 2019-08-22 00:05:03,211 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
 - ResourceManager akka.tcp://flink@jobmanager:6123/user/resourcemanager <> was granted
leadership with fencing token 00000000000000000000000000000000
> 
> jobmanager_1    | 2019-08-22 00:05:03,292 INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher
     - Dispatcher akka.tcp://flink@jobmanager:6123/user/dispatcher <> was granted leadership
with fencing token 00000000-0000-0000-0000-000000000000
> 
> taskmanager_1   | 2019-08-22 00:05:03,713 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/jobmanager(00000000000000000000000000000000)
<>.
> taskmanager_1   | 2019-08-22 00:05:04,137 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor
           - Could not resolve ResourceManager address akka.tcp://flink@jobmanager:6123/user/jobmanager
<>, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink@jobmanager:6123/user/jobmanager
<>..
> 
> Is this a known bug? I’d appreciate any help I can get.
> 
> Thanks,
> Aleksandar Mastilovic


Mime
View raw message