flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongwon Kim <eastcirc...@gmail.com>
Subject Two issues when deploying Flink on DC/OS
Date Tue, 09 Jan 2018 09:29:06 GMT
Hi,

I've launched JobManager and TaskManager on DC/OS successfully.
Now I have two new issues:

1) All TaskManagers are scheduled on a single node. 
- Is it intended to maximize data locality and minimize network communication cost?
- Is there an option in Flink to adjust the behavior of JobManager when it considers multiple
resource offers from different Mesos agents?
- I want to schedule TaskManager processes on different GPU servers so that each TaskManger
process can use its own GPU cards exclusively.  
- Below is a part of JobManager log that is occurring while JobManager is negotiating resources
with the Mesos master:
2018-01-09 07:34:54,872 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosJobManager
 - JobManager akka.tcp://flink@dnn-g08-233:18026/user/jobmanager was granted leadership with
leader session ID Some(00000000-0000-0000-0000-000000000000).
2018-01-09 07:34:55,889 INFO  org.apache.flink.mesos.scheduler.ConnectionMonitor         
  - Connecting to Mesos...
2018-01-09 07:34:55,962 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Trying to associate with JobManager leader akka.tcp://flink@dnn-g08-233:18026/user/jobmanager
2018-01-09 07:34:55,977 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#-1481183359]
- leader session 00000000-0000-0000-0000-000000000000
2018-01-09 07:34:56,479 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Scheduling Mesos task taskmanager-00001 with (10240.0 MB, 8.0 cpus).
2018-01-09 07:34:56,481 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Scheduling Mesos task taskmanager-00002 with (10240.0 MB, 8.0 cpus).
2018-01-09 07:34:56,481 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Scheduling Mesos task taskmanager-00003 with (10240.0 MB, 8.0 cpus).
2018-01-09 07:34:56,481 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Scheduling Mesos task taskmanager-00004 with (10240.0 MB, 8.0 cpus).
2018-01-09 07:34:56,481 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Scheduling Mesos task taskmanager-00005 with (10240.0 MB, 8.0 cpus).
2018-01-09 07:34:56,483 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Now gathering offers for at least 5 task(s).
2018-01-09 07:34:56,484 INFO  org.apache.flink.mesos.scheduler.ConnectionMonitor         
  - Connected to Mesos as framework ID 59b85b42-a4a2-4632-9578-9e480585ecdc-0004.
2018-01-09 07:34:56,690 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Received offer(s) of 606170.0 MB, 234.2 cpus:
2018-01-09 07:34:56,692 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 111186.0 MB, 45.9 cpus
for [*]
2018-01-09 07:34:56,692 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of 123506.0 MB, 47.3 cpus
for [*]
2018-01-09 07:34:56,692 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of 124530.0 MB, 46.6 cpus
for [*]
2018-01-09 07:34:56,692 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2177 from 50.1.100.231 of 123474.0 MB, 47.2 cpus
for [*]
2018-01-09 07:34:56,693 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of 123474.0 MB, 47.2 cpus
for [*]
2018-01-09 07:34:57,711 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Processing 5 task(s) against 5 new offer(s) plus outstanding offers.
2018-01-09 07:34:57,726 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Resources considered: (note: expired offers not deducted from below)
2018-01-09 07:34:57,727 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   50.1.100.234 has 124530.0 MB, 46.6 cpus
2018-01-09 07:34:57,728 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   50.1.100.235 has 123506.0 MB, 47.3 cpus
2018-01-09 07:34:57,728 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   50.1.100.232 has 123474.0 MB, 47.2 cpus
2018-01-09 07:34:57,728 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   50.1.100.233 has 111186.0 MB, 45.9 cpus
2018-01-09 07:34:57,728 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   50.1.100.231 has 123474.0 MB, 47.2 cpus
2018-01-09 07:34:58,069 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Launching Mesos task taskmanager-00005 on host 50.1.100.231.
2018-01-09 07:34:58,069 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Launched 5 task(s) on 50.1.100.231 using 1 offer(s):
2018-01-09 07:34:58,070 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Launching Mesos task taskmanager-00002 on host 50.1.100.231.
2018-01-09 07:34:58,070 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Launching Mesos task taskmanager-00003 on host 50.1.100.231.
2018-01-09 07:34:58,070 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Launching Mesos task taskmanager-00004 on host 50.1.100.231.
2018-01-09 07:34:58,070 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - Launching Mesos task taskmanager-00001 on host 50.1.100.231.
2018-01-09 07:34:58,070 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  -   59b85b42-a4a2-4632-9578-9e480585ecdc-O2177
2018-01-09 07:34:58,071 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - No longer gathering offers; all requests fulfilled.
2018-01-09 07:34:58,072 INFO  com.netflix.fenzo.TaskScheduler                            
  - Expiring all leases
2018-01-09 07:34:58,072 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2176 from 50.1.100.234 of 124530.0
MB, 46.6 cpus.
2018-01-09 07:34:58,073 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2175 from 50.1.100.235 of 123506.0
MB, 47.3 cpus.
2018-01-09 07:34:58,073 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2178 from 50.1.100.232 of 123474.0
MB, 47.2 cpus.
2018-01-09 07:34:58,074 INFO  org.apache.flink.mesos.scheduler.LaunchCoordinator         
  - Declined offer 59b85b42-a4a2-4632-9578-9e480585ecdc-O2174 from 50.1.100.233 of 111186.0
MB, 45.9 cpus.
2018-01-09 07:35:05,868 INFO  org.apache.flink.mesos.scheduler.TaskMonitor               
  - Mesos task taskmanager-00005 is running.
2018-01-09 07:35:06,103 INFO  org.apache.flink.mesos.scheduler.TaskMonitor               
  - Mesos task taskmanager-00001 is running.
2018-01-09 07:35:06,111 INFO  org.apache.flink.mesos.scheduler.TaskMonitor               
  - Mesos task taskmanager-00004 is running.
2018-01-09 07:35:06,116 INFO  org.apache.flink.mesos.scheduler.TaskMonitor               
  - Mesos task taskmanager-00002 is running.
2018-01-09 07:35:06,119 INFO  org.apache.flink.mesos.scheduler.TaskMonitor               
  - Mesos task taskmanager-00003 is running.
2018-01-09 07:35:14,377 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - TaskManager taskmanager-00003 has started.
2018-01-09 07:35:14,380 INFO  org.apache.flink.runtime.instance.InstanceManager          
  - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1027/user/taskmanager)
as b94277c8ad550eeef5364947e4330c00. Current number of registered hosts is 1. Current number
of alive task slots is 8.
2018-01-09 07:35:14,389 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - TaskManager taskmanager-00004 has started.
2018-01-09 07:35:14,389 INFO  org.apache.flink.runtime.instance.InstanceManager          
  - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1033/user/taskmanager)
as e0183a5317b331b90496049b1893c922. Current number of registered hosts is 2. Current number
of alive task slots is 16.
2018-01-09 07:35:14,462 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - TaskManager taskmanager-00001 has started.
2018-01-09 07:35:14,462 INFO  org.apache.flink.runtime.instance.InstanceManager          
  - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1029/user/taskmanager)
as 8d85b49d4118514552fcad3b98fef3e2. Current number of registered hosts is 3. Current number
of alive task slots is 24.
2018-01-09 07:35:14,465 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - TaskManager taskmanager-00005 has started.
2018-01-09 07:35:14,465 INFO  org.apache.flink.runtime.instance.InstanceManager          
  - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1031/user/taskmanager)
as b740607fb2e88bcfc275498bb54ed9fd. Current number of registered hosts is 4. Current number
of alive task slots is 32.
2018-01-09 07:35:14,560 INFO  org.apache.flink.mesos.runtime.clusterframework.MesosFlinkResourceManager
 - TaskManager taskmanager-00002 has started.
2018-01-09 07:35:14,560 INFO  org.apache.flink.runtime.instance.InstanceManager          
  - Registered TaskManager at DNN-G08-231 (akka.tcp://flink@dnn-g08-231:1025/user/taskmanager)
as 95433440f37ea1790e7ef9309f110fe4. Current number of registered hosts is 5. Current number
of alive task slots is 40.


2) After the TaskManagers are started, the following lines are repeated in the JobManage log
every second:
2018-01-09 07:36:51,080 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
 - Caught exception
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447)
	at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
2018-01-09 07:37:43,600 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
 - Caught exception
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447)
	at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
2018-01-09 07:38:43,619 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
 - Caught exception
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447)
	at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
2018-01-09 07:39:43,630 ERROR org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
 - Caught exception
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.flink.shaded.netty4.io.netty.buffer.UnpooledUnsafeDirectByteBuf.setBytes(UnpooledUnsafeDirectByteBuf.java:447)
	at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
	at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
- Can I ignore this exception? or there's something I should fix up?

Best,

- Dongwon


Mime
View raw message