flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deepak Jha <dkjhan...@gmail.com>
Subject Re: Flink HA on AWS: Network related issue
Date Fri, 09 Sep 2016 15:01:50 GMT
Hi Till,
I'm getting following message in Jobmanager log

2016-09-09 07:46:55,093 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.remote.RemoteWatcher - *Detected
unreachable: [akka.tcp://flink@10.8.4.57:6121
<http://flink@10.8.4.57:6121>]*
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985]
o.a.f.runtime.jobmanager.JobManager - Task manager akka.tcp://
flink@10.8.4.57:6121/user/taskmanager terminated.
2016-09-09 07:46:55,094 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] o.a.f.r.instance.InstanceManager
- Unregistered task manager akka.tcp://flink@10.8.4.57:6121/user/taskmanager.
Number of registered task managers 2. Number of available slots 4.
2016-09-09 07:46:55,096 PDT [WARN]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] Remoting - Association to
[akka.tcp://flink@10.8.4.57:6121] having UID [-1223410403] is irrecoverably
failed. *UID is now quarantined and all messages to this UID will be
delivered to dead letters. Remote actorsystem must be restarted to recover
from this situation.*
2016-09-09 07:46:55,097 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-982] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0/endpointWriter/endpointReader-akka.tcp%3A%2F%2Fflink%4010.8.4.57%3A6121-0#393939009]
was not delivered. [54] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:55,098 PDT [INFO]  ip-10-8-11-249
[flink-akka.actor.default-dispatcher-985] akka.actor.LocalActorRef -
Message [akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://flink/deadLetters] to
Actor[akka://flink/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2Fflink%4010.8.4.57%3A51291-2#1151730456]
was not delivered. [55] dead letters encountered. This logging can be
turned off or adjusted with configuration settings 'akka.log-dead-letters'
and 'akka.log-dead-letters-during-shutdown'.
2016-09-09 07:46:58,479 PDT [INFO]  ip-10-8-11-249
[ForkJoinPool-3-worker-1] o.a.f.r.c.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper.

Hope it helps. I'm using Flink 1.0.2

On Fri, Sep 9, 2016 at 12:34 AM, Till Rohrmann <trohrmann@apache.org> wrote:

> Hi Deepak,
>
> could you check the logs whether the JobManager has been quarantined and
> thus, cannot be connected to anymore? The logs should at least contain a
> hint why the TaskManager lost the connection initially.
>
> Cheers,
> Till
>
> On Thu, Sep 8, 2016 at 7:08 PM, Deepak Jha <dkjhanitt@gmail.com> wrote:
>
> > Hi,
> > I've setup Flink HA on AWS ( 3 Taskmanagers and 2 Jobmanagers each are on
> > EC2 m4.large instance with checkpoint enabled on S3 ). My topology works
> > fine, but after few hours I do see that Taskmanagers gets detached with
> > Jobmanager. I tried to reach Jobmanager using telnet at the same time and
> > it worked but Taskmanager does not succeed in connecting again. It
> attaches
> > only after I restart it. I tried following settings but still the problem
> > persists.
> >
> > akka.ask.timeout: 20 s
> > akka.lookup.timeout: 20 s
> > akka.watch.heartbeat.interval: 20 s
> >
> > Please find attached snapshot on one of the Taskmanager. Is there any
> > setting that I need to do ?
> >
> > --
> > Thanks,
> > Deepak Jha
> >
> >
>



-- 
Thanks,
Deepak Jha

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message