flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7340) Taskmanager hung after temporary DNS outage
Date Sat, 14 Oct 2017 16:28:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16204715#comment-16204715
] 

Stephan Ewen commented on FLINK-7340:
-------------------------------------

Thanks, that is a good pointer!

[~till.rohrmann] Can we take this into account in the Akka / HA address management

> Taskmanager hung after temporary DNS outage
> -------------------------------------------
>
>                 Key: FLINK-7340
>                 URL: https://issues.apache.org/jira/browse/FLINK-7340
>             Project: Flink
>          Issue Type: Bug
>          Components: Core, Distributed Coordination
>    Affects Versions: 1.3.1
>         Environment: Non-HA Flink running in Kubernetes.
>            Reporter: Joshua Griffith
>
> After a Kubernetes node failure, several TaskManagers and the DNS system were automatically
restarted. One TaskManager was unable to connect to the JobManager and continually logged
the following errors:
> {quote}
> 2017-08-01 18:58:06.707 [flink-akka.actor.default-dispatcher-823] INFO  org.apache.flink.runtime.taskmanager.TaskManager
 - Trying to register at JobManager akka.tcp://flink@jobmanager:6123/user/jobmanager (attempt
595, timeout: 30000 milliseconds)
> 2017-08-01 18:58:06.713 [flink-akka.actor.default-dispatcher-834] INFO  Remoting flink-akka.remote.default-remote-dispatcher-240
- Quarantined address [akka.tcp://flink@jobmanager:6123] is still unreachable or has not been
restarted. Keeping it quarantined.
> {quote}
> After exec'ing into the container, I was able to {{telnet jobmanager 6123}} successfully
and {{dig jobmanager}} showed the correct IP in DNS. I suspect that the TaskManager cached
a bad IP address for the JobManager when the DNS system was restarting and it used that cached
address rather than respecting the 30s TTL and getting a new one for the next request. It
may be a good idea for the TaskManager to explicitly perform a DNS lookup after JobManager
connection failures.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message