flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Yao <g...@ververica.com>
Subject Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager
Date Fri, 15 Mar 2019 07:08:45 GMT
Hi Harshith,

In the jobmanager.sh script, the 2nd argument is assigned to the HOST
variable
[1]. How are you invoking jobmanager.sh? Prior to 1.5, the script expected
an
execution mode (local or cluster) but this is no longer the case [2].

Best,
Gary

[1]
https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh
[2]
https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98

On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith <hkumv@arity.com>
wrote:

> Hi Gary,
>
>
>
> An update. I noticed the line “–host cluster” in the program arguments
> section of the job manager logs. So, I commented the following section in
> jobmanager.sh, the task manager is now able to connect to job manager
> without issues.
>
>
>
>   *if [ ! -z $HOST ]; then*
>
> *        args+=("--host")*
>
> *        args+=("${HOST}")*
>
> *fi*
>
>
>
>
>
> Task manager logs after commenting those lines:
>
>
>
>
> * 2019-03-14 22:31:02,863 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
> akka://flink/user/taskmanager_0 .*
>
> *2019-03-14 22:31:02,875 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.*
>
> *2019-03-14 22:31:02,876 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
> leader service.*
>
> *2019-03-14 22:31:02,877 INFO
> org.apache.flink.runtime.filecache.FileCache                  - User file
> cache uses directory
> /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008*
>
> *2019-03-14 22:31:02,884 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
> to ResourceManager
> akka.tcp://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)
> <http://flink@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.*
>
> *2019-03-14 22:31:03,109 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved
> ResourceManager address, beginning registration*
>
> *2019-03-14 22:31:03,110 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
> Registration at ResourceManager attempt 1 (timeout=100ms)*
>
> *2019-03-14 22:31:03,228 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            -
> Registration at ResourceManager attempt 2 (timeout=200ms)*
>
> *2019-03-14 22:31:03,266 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Successful
> registration at resource manager
> akka.tcp://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager
> <http://flink@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager>
> under registration id 170ee6a00f80ee02ead0e88710093d77.*
>
>
>
>
>
> Thanks,
>
> Harshith
>
>
>
> *From: *Harshith Kumar Bolar <hkumv@arity.com>
> *Date: *Friday, 15 March 2019 at 7:38 AM
> *To: *Gary Yao <gary@ververica.com>
> *Cc: *user <user@flink.apache.org>
> *Subject: *Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to
> connect to Job Manager
>
>
>
> Hi Gary,
>
>
>
> Here are the full job manager and task manager logs. In the job manager
> logs, I see it says “*starting StandaloneSessionClusterEntrypoint”,* whereas
> in Flink 1.4.2, it used to say “*starting JobManager”*. Is this correct?
>
>
>
> Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq
> /)
>
> Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ (
> https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/)
>
>
>
> Thanks,
>
> Harshith
>
>
>
> *From: *Gary Yao <gary@ververica.com>
> *Date: *Thursday, 14 March 2019 at 10:11 PM
> *To: *Harshith Kumar Bolar <hkumv@arity.com>
> *Cc: *user <user@flink.apache.org>
> *Subject: *[External] Re: Re: Flink 1.7.2: Task Manager not able to
> connect to Job Manager
>
>
>
> Hi Harshith,
>
> The truncated log is not enough. Can you share the complete logs? If that's
> not possible, I'd like to see the beginning of the log files where the
> cluster
> configuration is logged.
>
> The TaskManager tries to connect to the leader that is advertised in
> ZooKeeper. In your case the "cluster" hostname is advertised which hints a
> problem in your Flink configuration.
>
> Best,
> Gary
>
>
>
> On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hkumv@arity.com>
> wrote:
>
> Hi Gary,
>
>
>
> I’ve attached the relevant portions of the JM and TM logs.
>
>
>
> *Job Manager Logs:*
>
> 2019-03-14 11:38:28,257 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
> - State change: CONNECTED
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component log file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log
> 2019-03-14 11:38:28,309 INFO
> org.apache.flink.runtime.webmonitor.WebMonitorUtils           - Determined
> location of main cluster component stdout file:
> /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Rest
> endpoint listening at cluster:8080
> 2019-03-14 11:38:28,527 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}.
> 2019-03-14 11:38:28,574 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    - Web
> frontend listening at http://cluster:8080
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
> .
> 2019-03-14 11:38:28,613 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at
> akka://flink/user/resourcemanager .
> 2019-03-14 11:38:28,674 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher
> at akka://flink/user/dispatcher .
> 2019-03-14 11:38:28,691 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
> 2019-03-14 11:38:28,694 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:38:28,698 INFO
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Starting ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
> 2019-03-14 11:38:28,700 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2019-03-14 11:38:28,818 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:22671] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:22671]] Caused by: [cluster]
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint    -
> http://cluster:8080
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=>
> was granted leadership with
> leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7
> 2019-03-14 11:39:09,010 INFO
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was
> granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb
> 2019-03-14 11:39:09,011 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
> Starting the SlotManager.
> 2019-03-14 11:39:09,012 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Dispatcher
> akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership
> with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9
> 2019-03-14 11:39:09,017 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Recovering
> all persisted jobs.
>
> *Task Manager Logs:*
>
> 2019-03-14 11:42:35,790 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill
> files.
> 2019-03-14 11:42:35,820 INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages
> have a max timeout of 10000 ms
> 2019-03-14 11:42:35,839 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting
> RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at
> akka://flink/user/taskmanager_0 .
> 2019-03-14 11:42:35,853 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:42:35,854 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job
> leader service.
> 2019-03-14 11:42:35,855 INFO
> org.apache.flink.runtime.filecache.FileCache                  - User file
> cache uses directory
> /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26
> 2019-03-14 11:42:35,871 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting
> to ResourceManager akka.tcp://flink@cluster
> :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb).
> 2019-03-14 11:42:35,963 WARN
> akka.remote.ReliableDeliverySupervisor                        - Association
> with remote system [akka.tcp://flink@cluster:31794] has failed, address
> is now gated for [50] ms. Reason: [Association failed with
> [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service
> not known]
> 2019-03-14 11:42:35,964 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not
> resolve ResourceManager address akka.tcp://flink@cluster:31794/user/resourcemanager,
> retrying in 10000 ms: Could not connect to rpc endpoint under address
> akka.tcp://flink@cluster:31794/user/resourcemanager..
> 2019-03-14 11:47:35,895 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error
> occurred in TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1023)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :332)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :158)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
> :260)
>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
> ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1339)
>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1979)
>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
> :107)
> 2019-03-14 11:47:35,897 ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner       - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this
> instance. Terminating now.
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1037)
>    at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(
> TaskExecutor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=>
> :1023)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :332)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :158)
>    at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(
> AkkaRpcActor.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=>
> :142)
>    at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>    at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>    at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>    at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>    at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>    at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=>
> :260)
>    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
> ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1339)
>    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=>
> :1979)
>    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(
> ForkJoinWorkerThread.java
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=>
> :107)
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopping
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  -
> Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
> 2019-03-14 11:47:35,904 INFO
> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.disk.iomanager.IOManager          - I/O manager
> removed spill file directory
> /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f
> 2019-03-14 11:47:35,908 INFO
> org.apache.flink.runtime.io.network.NetworkEnvironment        - Shutting
> down the network environment and its components.
> 2019-03-14 11:47:35,914 INFO
> org.apache.flink.runtime.io.network.netty.NettyClient         - Successful
> shutdown (took 5 ms).
> 2019-03-14 11:47:35,917 INFO
> org.apache.flink.runtime.io.network.netty.NettyServer         - Successful
> shutdown (took 2 ms).
> 2019-03-14 11:47:35,925 INFO
> org.apache.flink.runtime.taskexecutor.JobLeaderService        - Stop job
> leader service.
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Stopped
> TaskExecutor akka.tcp://
> flink@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=>
> .
> 2019-03-14 11:47:35,931 INFO
> org.apache.flink.runtime.blob.PermanentBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,933 INFO
> org.apache.flink.runtime.blob.TransientBlobCache              - Shutting
> down BLOB cache
> 2019-03-14 11:47:35,943 INFO
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
> - backgroundOperationsLoop exiting
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  -
> Session: 0x26977a24c4e0018 closed
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  -
> EventThread shut down for session: 0x26977a24c4e0018
> 2019-03-14 11:47:35,950 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopping
> Akka RPC service.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,952 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,959 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Shutting
> down remote daemon.
> 2019-03-14 11:47:35,966 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remote
> daemon shut down; proceeding with flushing remote transports.
> 2019-03-14 11:47:35,983 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,984 INFO
> akka.remote.RemoteActorRefProvider$RemotingTerminator         - Remoting
> shut down.
> 2019-03-14 11:47:35,992 INFO
> org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Stopped
> Akka RPC service.
>
>
>
>
>
> *From: *Gary Yao <gary@ververica.com>
> *Date: *Thursday, 14 March 2019 at 9:06 PM
> *To: *Harshith Kumar Bolar <hkumv@arity.com>
> *Cc: *user <user@flink.apache.org>
> *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect
> to Job Manager
>
>
>
> Hi Harshith,
>
>
>
> Can you share JM and TM logs?
>
>
>
> Best,
>
> Gary
>
>
>
> On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hkumv@arity.com>
> wrote:
>
> Hi all,
>
>
>
> I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2
>
>
>
> When I bring up the cluster, the task managers refuse to connect to the
> job managers with the following error.
>
>
>
>         2019-03-14 10:34:41,551 WARN
> akka.remote.ReliableDeliverySupervisor
>
>         - Association with remote system [akka.tcp://flink@cluster:22671]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or
> service not known]
>
>
>
> Now, this works correctly if I add the following line into
> the /etc/hosts file.
>
>
>
>         x.x.x.x job-manager-address.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=>
> cluster
>
>
>
> Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink
> 1.4.2 used to have the job manager's address instead of the word cluster.
>
>
>
> Thanks,
>
> Harshith
>
>
>
>

Mime
View raw message