flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Scott Kidder (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-7021) Flink Task Manager hangs on startup if one Zookeeper node is unresolvable
Date Wed, 28 Jun 2017 14:11:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Scott Kidder updated FLINK-7021:
--------------------------------
    Affects Version/s: 1.3.1

> Flink Task Manager hangs on startup if one Zookeeper node is unresolvable
> -------------------------------------------------------------------------
>
>                 Key: FLINK-7021
>                 URL: https://issues.apache.org/jira/browse/FLINK-7021
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.2.0, 1.3.0, 1.2.1, 1.3.1
>         Environment: Kubernetes cluster running:
> * Flink 1.3.0 Job Manager & Task Manager on Java 8u131
> * Zookeeper 3.4.10 cluster with 3 nodes
>            Reporter: Scott Kidder
>
> h2. Problem
> Flink Task Manager will hang during startup if one of the Zookeeper nodes in the Zookeeper
connection string is unresolvable.
> h2. Expected Behavior
> Flink should retry name resolution & connection to Zookeeper nodes with exponential
back-off.
> h2. Environment Details
> We're running Flink and Zookeeper in Kubernetes on CoreOS. CoreOS can run in a configuration
that automatically detects and applies operating system updates. We have a Zookeeper node
running on the same CoreOS instance as Flink. It's possible that the Zookeeper node will not
yet be started when the Flink components are started. This could cause hostname resolution
of the Zookeeper nodes to fail.
> h3. Flink Task Manager Logs
> {noformat}
> 2017-06-27 15:38:51,713 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Using configured hostname/address for TaskManager: 10.2.45.11
> 2017-06-27 15:38:51,714 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Starting TaskManager
> 2017-06-27 15:38:51,714 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Starting TaskManager actor system at 10.2.45.11:6122.
> 2017-06-27 15:38:52,950 INFO  akka.event.slf4j.Slf4jLogger                          
       - Slf4jLogger started
> 2017-06-27 15:38:53,079 INFO  Remoting                                              
       - Starting remoting
> 2017-06-27 15:38:53,573 INFO  Remoting                                              
       - Remoting started; listening on addresses :[akka.tcp://flink@10.2.45.11:6122]
> 2017-06-27 15:38:53,576 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Starting TaskManager actor
> 2017-06-27 15:38:53,660 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig 
       - NettyConfig [server address: /10.2.45.11, server port: 6121, ssl enabled: false,
memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual),
number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client
connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
> 2017-06-27 15:38:53,682 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration
 - Messages have a max timeout of 10000 ms
> 2017-06-27 15:38:53,688 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerServices
    - Temporary file directory '/tmp': total 49 GB, usable 42 GB (85.71% usable)
> 2017-06-27 15:38:54,071 INFO  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool
 - Allocated 96 MB for network buffer pool (number of memory segments: 3095, bytes per segment:
32768).
> 2017-06-27 15:38:54,564 INFO  org.apache.flink.runtime.io.network.NetworkEnvironment
       - Starting the network environment and its components.
> 2017-06-27 15:38:54,576 INFO  org.apache.flink.runtime.io.network.netty.NettyClient 
       - Successful initialization (took 4 ms).
> 2017-06-27 15:38:54,677 INFO  org.apache.flink.runtime.io.network.netty.NettyServer 
       - Successful initialization (took 101 ms). Listening on SocketAddress /10.2.45.11:6121.
> 2017-06-27 15:38:54,981 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerServices
    - Limiting managed memory to 0.7 of the currently free heap space (612 MB), memory will
be allocated lazily.
> 2017-06-27 15:38:55,050 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager  
       - I/O manager uses directory /tmp/flink-io-ca01554d-f25e-4c17-a828-96d82b43d4a7 for
spill files.
> 2017-06-27 15:38:55,061 INFO  org.apache.flink.runtime.metrics.MetricRegistry       
       - Configuring StatsDReporter with {interval=10 SECONDS, port=8125, host=localhost,
class=org.apache.flink.metrics.statsd.StatsDReporter}.
> 2017-06-27 15:38:55,065 INFO  org.apache.flink.metrics.statsd.StatsDReporter        
       - Configured StatsDReporter with {host:localhost, port:8125}
> 2017-06-27 15:38:55,065 INFO  org.apache.flink.runtime.metrics.MetricRegistry       
       - Periodically reporting metrics in intervals of 10 SECONDS for reporter statsd of
type org.apache.flink.metrics.statsd.StatsDReporter.
> 2017-06-27 15:38:55,175 INFO  org.apache.flink.runtime.filecache.FileCache          
       - User file cache uses directory /tmp/flink-dist-cache-e4c5bcc5-7513-40d9-a665-0d33c80a36ba
> 2017-06-27 15:38:55,187 INFO  org.apache.flink.runtime.filecache.FileCache          
       - User file cache uses directory /tmp/flink-dist-cache-310ba2f8-f96a-4c3f-b1db-35ac26b83f7e
> 2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Starting TaskManager actor at akka://flink/user/taskmanager#207081801.
> 2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - TaskManager data connection information: 7f86855dac2af4cca9eb2ae4c046630e @ flink-taskmanager-3116622558-sqggc
(dataPort=6121)
> 2017-06-27 15:38:55,273 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - TaskManager has 2 task slot(s).
> 2017-06-27 15:38:55,276 INFO  org.apache.flink.runtime.taskmanager.TaskManager      
       - Memory usage stats: [HEAP: 124/981/981 MB, NON HEAP: 43/44/-1 MB (used/committed/max)]
> 2017-06-27 15:38:55,276 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
 - Starting ZooKeeperLeaderRetrievalService.
> 2017-06-27 15:39:10,289 ERROR org.apache.flink.shaded.org.apache.curator.ConnectionState
   - Connection timed out for connection string (zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181)
and timeout (15000) / elapsed (18617)
> org.apache.flink.shaded.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode
= ConnectionLoss
> 	at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
> 	at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
> 	at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
> 	at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
> 	at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
> 	at akka.actor.ActorCell.create(ActorCell.scala:580)
> 	at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> 	at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> 	at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2017-06-27 15:39:30,349 INFO  org.apache.zookeeper.ZooKeeper                        
       - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
> 2017-06-27 15:40:00,388 WARN  org.apache.flink.shaded.org.apache.curator.ConnectionState
   - Connection attempt unsuccessful after 68719 (greater than max timeout of 60000). Resetting
connection and trying again with a new connection.
> 2017-06-27 15:40:00,388 INFO  org.apache.zookeeper.ZooKeeper                        
       - Initiating client connection, connectString=zookeeper-0.zookeeper:2181,zookeeper-1.zookeeper:2181,zookeeper-2.zookeeper:2181
sessionTimeout=60000 watcher=org.apache.flink.shaded.org.apache.curator.ConnectionState@16f7b4af
> 2017-06-27 15:40:00,450 ERROR org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl
 - Ensure path threw exception
> java.net.UnknownHostException: zookeeper-1.zookeeper: Name or service not known
> 	at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
> 	at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> 	at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> 	at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> 	at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> 	at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
> 	at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
> 	at org.apache.flink.shaded.org.apache.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:29)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:150)
> 	at org.apache.flink.shaded.org.apache.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94)
> 	at org.apache.flink.shaded.org.apache.curator.HandleHolder.internalClose(HandleHolder.java:128)
> 	at org.apache.flink.shaded.org.apache.curator.HandleHolder.closeAndReset(HandleHolder.java:77)
> 	at org.apache.flink.shaded.org.apache.curator.ConnectionState.reset(ConnectionState.java:261)
> 	at org.apache.flink.shaded.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:221)
> 	at org.apache.flink.shaded.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
> 	at org.apache.flink.shaded.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
> 	at org.apache.flink.shaded.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)
> 	at org.apache.flink.shaded.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
> 	at org.apache.flink.shaded.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
> 	at org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService.start(ZooKeeperLeaderRetrievalService.java:100)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.preStart(TaskManager.scala:205)
> 	at akka.actor.Actor$class.aroundPreStart(Actor.scala:472)
> 	at org.apache.flink.runtime.taskmanager.TaskManager.aroundPreStart(TaskManager.scala:120)
> 	at akka.actor.ActorCell.create(ActorCell.scala:580)
> 	at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
> 	at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
> 	at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
> 	at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message