flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: TaskManager unable to register with JobManager
Date Wed, 10 Feb 2016 14:52:59 GMT
Note that some of these config options are only available starting from
version 1.0-SNAPSHOT

On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske <fhueske@gmail.com> wrote:

> Hi Ravinder,
>
> please have a look at the configuration documentation:
>
> -->
> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager
>
> Best, Fabian
>
> 2016-02-10 13:55 GMT+01:00 Ravinder Kaur <neetu0404@gmail.com>:
>
>> Hello All,
>>
>> I need to know the range of ports that are being used during the
>> master/slave communication in the Flink cluster. Also is there a way I can
>> specify a range of ports, at the slaves, to restrict them to connect to
>> master only in this range?
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>>
>> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen <sewen@apache.org> wrote:
>>
>>> Can machines connect to port 6123? The firewall may block that port, put
>>> permit SSH.
>>>
>>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur <neetu0404@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Here is the log file of Jobmanager. I did not see some thing suspicious
>>>> and as it suggests the ports are also listening.
>>>>
>>>> 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManager on IP-of-master:6123 with execution mode
>>>> CLUSTER and streaming mode BATCH_ONLY
>>>> 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Security is not enabled. Starting non-authenticated JobManager.
>>>> 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManager
>>>> 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManager actor system at 10.155.208.138:6123
>>>> 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
>>>>          - Slf4jLogger started
>>>> 20:58:48,295 INFO  Remoting
>>>>          - Starting remoting
>>>> 20:58:48,541 INFO  Remoting
>>>>          - Remoting started; listening on addresses
>>>> :[akka.tcp://flink@IP-of-master:6123]
>>>> 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManger web frontend
>>>> 20:58:48,690 INFO
>>>>  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Using
>>>> directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web
>>>> interface files
>>>> 20:58:48,691 INFO
>>>>  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving
>>>> job manager log from
>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
>>>> 20:58:48,691 INFO
>>>>  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Serving
>>>> job manager stdout from
>>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
>>>> 20:58:49,044 INFO
>>>>  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Web
>>>> frontend listening at 0:0:0:0:0:0:0:0:8081
>>>> 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManager actor
>>>> 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>          - Created BLOB server storage directory
>>>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
>>>> 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
>>>>          - Started BLOB server at 0.0.0.0:43683 - max concurrent
>>>> requests: 50 - max backlog: 1000
>>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>>>>           - Started memory archivist akka://flink/user/archive
>>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Starting JobManager at akka.tcp://flink@IP-of-master
>>>> :6123/user/jobmanager.
>>>> 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
>>>> was granted leadership with leader session ID None.
>>>> 20:58:49,082 INFO
>>>>  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor         - Starting
>>>> with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on
>>>> port 8081
>>>> 20:58:49,083 INFO
>>>>  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader
>>>> reachable under akka.tcp://flink@IP-of-master
>>>> :6123/user/jobmanager:null.
>>>> 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>>> at Wed Feb 03 20:59:22 CET 2016).
>>>> 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>>> at Wed Feb 03 20:59:22 CET 2016).
>>>> 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>>> at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
>>>> 20:59:22,859 INFO
>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN
>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
>>>> 20:59:22,881 INFO
>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN
>>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
>>>> 20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>>> at Wed Feb 03 20:59:22 CET 2016) changed to FAILING.
>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>> Not enough free slots available to run the job. You can decrease the
>>>> operator parallelism or increase the number of slots per TaskManager in the
>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
>>>> getDefaultTextLineDataSet(WordCountData.java:70)
>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
>>>> Number of instances=0, total number of slots=0, available slots=0
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>         at
>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>>>>         at
>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>>>>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>>>>         at
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>         at
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> 20:59:22,886 INFO
>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph        - CHAIN
>>>> Reduce (SUM(1), at main(WordCount.java:72) -> FlatMap (collect()) (1/1)
>>>> (824b6e3771304cd0f92aea4ab763a11d) switched from CREATED to CANCELED
>>>> 20:59:22,887 INFO
>>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph        - DataSink
>>>> (collect() sink) (1/1) (1bb64a2edc6f68ad716acd9f8d2d7d67) switched from
>>>> CREATED to CANCELED
>>>> 20:59:22,890 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>>          - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>>> at Wed Feb 03 20:59:22 CET 2016) changed to FAILED.
>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>>> Not enough free slots available to run the job. You can decrease the
>>>> operator parallelism or increase the number of slots per TaskManager in the
>>>> configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at
>>>> getDefaultTextLineDataSet(WordCountData.java:70)
>>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>>> (1/1)) @ (unassigned) - [SCHEDULED] > with groupID <
>>>> 31e497f2f68c9cee5864c8fddaff3d59 > in sharing group < SlotSharingGroup
>>>> [f9ed1aab933e061a8ce1ecaa3534f18c, 037bb78a1902f7edea69a978ad7b54ce,
>>>> 31e497f2f68c9cee5864c8fddaff3d59] >. Resources available to scheduler:
>>>> Number of instances=0, total number of slots=0, available slots=0
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(Scheduler.java:256)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmediately(Scheduler.java:131)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecution(Execution.java:298)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForExecution(ExecutionVertex.java:458)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAll(ExecutionJobVertex.java:322)
>>>>         at
>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExecution(ExecutionGraph.java:679)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>         at
>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>         at
>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>>>>         at
>>>> scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>>>>
>>>>
>>>> On Wed, Feb 3, 2016 at 9:27 PM, Robert Metzger <rmetzger@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> the TaskManager is starting up, but its not able to register at the
>>>>> job manager. Did you check the JobManager log? Do you see anything
>>>>> suspicious there? Are the ports matching?
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <neetu0404@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Thank you for pointing it out. I had a little typo while I edited
the
>>>>>> hostname in flink-conf.yaml. I've reset it and the TaskManager started
up.
>>>>>> But I still can't run the WordCount example and it throws the same
>>>>>> NoResourceAvaliableException.
>>>>>>
>>>>>> Caused by:
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce
>>>>>>
>>>>>>      ption: Not enough free slots available to run the job. You can
>>>>>> decrease the oper
>>>>>>                              ator parallelism or increase the number
of
>>>>>> slots per TaskManager in the configur
>>>>>>                                                  ation. Task to schedule:
<
>>>>>> Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa
>>>>>>
>>>>>>  taSet(WordCountData.java:70)
>>>>>> (org.apache.flink.api.java.io.CollectionInputFormat
>>>>>>                                                                ))
->
>>>>>> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1),
at main(Wo
>>>>>>
>>>>>>            rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] >
with
>>>>>> groupID < 31e497f2f6
>>>>>>                                  8c9cee5864c8fddaff3d59 > in sharing
group
>>>>>> < SlotSharingGroup [f9ed1aab933e061a8c
>>>>>>                                                    e1ecaa3534f18c,
>>>>>> 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d
>>>>>>
>>>>>>  59] >. Resources available to scheduler: Number of instances=0,
total
>>>>>> number of
>>>>>>                       slots=0, available slots=0
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(
>>>>>>
>>>>>>      Scheduler.java:256)
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed
>>>>>>
>>>>>>      iately(Scheduler.java:131)
>>>>>>         at
>>>>>> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio
>>>>>>
>>>>>>      n(Execution.java:298)
>>>>>>         at
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx
>>>>>>
>>>>>>      ecution(ExecutionVertex.java:458)
>>>>>>         at
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl
>>>>>>
>>>>>>      l(ExecutionJobVertex.java:322)
>>>>>>         at
>>>>>> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe
>>>>>>
>>>>>>      cution(ExecutionGraph.java:679)
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>
>>>>>>
>>>>>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982
>>>>>>
>>>>>>            )
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>
>>>>>>
>>>>>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>>         at
>>>>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>>>>>>
>>>>>>
>>>>>>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>>>>>>         ... 8 more
>>>>>>
>>>>>> The log of TaskManager again has the same errors as before.
>>>>>>
>>>>>> 20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>            - Failed to connect from address '/slave-IP': connect
timed out
>>>>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>            - Failed to connect from address '/0:0:0:0:0:0:0:1%1':
Network
>>>>>> is unreachable
>>>>>> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>            - Failed to connect from address '/127.0.0.1': Invalid
>>>>>> argument
>>>>>> 20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
>>>>>>            - Could not connect to /master-IP:6123. Selecting a local
>>>>>> address using heuristics.
>>>>>> 20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - TaskManager will use hostname/address 'hostname-of-slave'
>>>>>> (slave-IP) for communication.
>>>>>> 20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Starting TaskManager in streaming mode BATCH_ONLY
>>>>>> 20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Starting TaskManager actor system at slave_IP:0
>>>>>> 20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
>>>>>>            - Slf4jLogger started
>>>>>> 20:58:59,842 INFO  Remoting
>>>>>>            - Starting remoting
>>>>>> 20:59:00,094 INFO  Remoting
>>>>>>            - Remoting started; listening on addresses
>>>>>> :[akka.tcp://flink@slave-IP:33813]
>>>>>> 20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Starting TaskManager actor
>>>>>> 20:59:00,125 INFO
>>>>>>  org.apache.flink.runtime.io.network.netty.NettyConfig         -
>>>>>> NettyConfig [server address: hostname-of-master/master-IP, server
port:
>>>>>> 49030, memory segment size (bytes): 32768, transport type: NIO, number
of
>>>>>> server threads: 0 (use Netty's default), number of client threads:
0 (use
>>>>>> Netty's default), server connect backlog: 0 (use Netty's default),
client
>>>>>> connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use
>>>>>> Netty's default)]
>>>>>> 20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Messages between TaskManager and JobManager have a max
timeout
>>>>>> of 100000 milliseconds
>>>>>> 20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Temporary file directory '/tmp': total 4 GB, usable
1 GB
>>>>>> (25.00% usable)
>>>>>> 20:59:00,210 INFO
>>>>>>  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  -
Allocated
>>>>>> 64 MB for network buffer pool (number of memory segments: 2048, bytes
per
>>>>>> segment: 32768).
>>>>>> 20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Using 0.7 of the currently free heap space for Flink
managed
>>>>>> heap memory (293 MB).
>>>>>> 20:59:00,565 INFO
>>>>>>  org.apache.flink.runtime.io.disk.iomanager.IOManager          -
I/O
>>>>>> manager uses directory /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8
>>>>>> for spill files.
>>>>>> 20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
>>>>>>            - User file cache uses directory
>>>>>> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
>>>>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Starting TaskManager actor at
>>>>>> akka://flink/user/taskmanager#-157676733.
>>>>>> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - TaskManager data connection information: hostname-of-master
>>>>>> (dataPort=49030)
>>>>>> 20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - TaskManager has 1 task slot(s).
>>>>>> 20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP:
24/49/304
>>>>>> MB (used/committed/max)]
>>>>>> 20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
>>>>>> (attempt 1, timeout: 500 milliseconds)
>>>>>> 20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
>>>>>> (attempt 2, timeout: 1000 milliseconds)
>>>>>> 20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
>>>>>> (attempt 3, timeout: 2000 milliseconds)
>>>>>> 20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
>>>>>> (attempt 4, timeout: 4000 milliseconds)
>>>>>> 20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>            - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
>>>>>> (attempt 5, timeout: 8000 milliseconds)
>>>>>>
>>>>>>
>>>>>> Kind Regards,
>>>>>> Ravinder Kaur
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <sewen@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> This looks like the reason:
>>>>>>>
>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager
>>>>>>> hostname 'hostname-of-master' specified in the configuration
>>>>>>>
>>>>>>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <neetu0404@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> The log file of the Taskmanager now shows the following
>>>>>>>>
>>>>>>>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>>>>>>>               - Unable to load native-hadoop library for
your platform...
>>>>>>>> using builtin-java classes where applicable
>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
>>>>>>>> Date:22.11.2015 @ 12:41:12 CET)
>>>>>>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Current user: flink
>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation
-
>>>>>>>> 1.7/24.91-b01
>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Maximum heap size: 491 MiBytes
>>>>>>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Hadoop version: 2.7.0
>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  JVM Options:
>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     -Xms512M
>>>>>>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     -Xmx512M
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     -XX:MaxDirectMemorySize=8388607T
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     -XX:MaxPermSize=256m
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -
>>>>>>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -
>>>>>>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -
>>>>>>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Program Arguments:
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     --configDir
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     /home/flink/flink-0.10.1/conf
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     --streamingMode
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -     batch
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -  Classpath:
>>>>>>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>>>>>>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              -
>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              - Maximum number of open file descriptors is
4096
>>>>>>>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              - Loading configuration from /home/flink/flink-0.10.1/conf
>>>>>>>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              - Security is not enabled. Starting non-authenticated
>>>>>>>> TaskManager.
>>>>>>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>>>>>>>              - Failed to run TaskManager.
>>>>>>>> java.net.UnknownHostException: Cannot resolve the JobManager
>>>>>>>> hostname 'hostname-of-master' specified in the configuration
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>>>>>>>         at
>>>>>>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>>>>>>>
>>>>>>>> Kind Regards,
>>>>>>>> Ravinder Kaur
>>>>>>>>
>>>>>>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <sewen@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> What do the TaskManger logs say?
>>>>>>>>>
>>>>>>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0404@gmail.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address
>>>>>>>>>> in flink-conf.yaml to the hostname of master node
on both the nodes.
>>>>>>>>>>
>>>>>>>>>> Now it does not start the Taskmanager at the worker
node at all.
>>>>>>>>>> When I start the cluster using ./bin/start-cluster.sh
on master it shows
>>>>>>>>>> the normal output of starting the Jobmanager and
Taskmanager but when I run
>>>>>>>>>> jps on the nodes the slave does not have the Taskmanager
running.
>>>>>>>>>>
>>>>>>>>>> Running the WordCount example again fails showing
the same error.
>>>>>>>>>> Stopping the cluster says no taskmanager to stop.
>>>>>>>>>>
>>>>>>>>>> Kind Regards,
>>>>>>>>>> Ravinder Kaur
>>>>>>>>>>
>>>>>>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <sewen@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Looks like the network configuration is not correct.
>>>>>>>>>>>
>>>>>>>>>>> I would try setting the full host name (like
"master.abc.xyz.com")
>>>>>>>>>>> as jobmanager.rpc.address.
>>>>>>>>>>>
>>>>>>>>>>> Greetings,
>>>>>>>>>>> Stephan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur
<
>>>>>>>>>>> neetu0404@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Community,
>>>>>>>>>>>>
>>>>>>>>>>>> I'm a student and new to Apache Flink. I'm
trying to learn and
>>>>>>>>>>>> have setup a 2- node standalone Flink(0.10.1)
cluster (one master and one
>>>>>>>>>>>> worker). I'm facing the following issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Cluster: consists of 2 vms (one master and
one worker)
>>>>>>>>>>>>
>>>>>>>>>>>> The configurations are done as per
>>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>>>>>>>
>>>>>>>>>>>> When I start the cluster both the JobManager
and the
>>>>>>>>>>>> TaskManager are started on the master and
worker respectively.
>>>>>>>>>>>>
>>>>>>>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>>>>>>>
>>>>>>>>>>>> JPS shows all the processes running.
>>>>>>>>>>>>
>>>>>>>>>>>> Then I run the following command to run a
WordCount example
>>>>>>>>>>>> job: ./bin/flink run ./examples/WordCount.jar
>>>>>>>>>>>>
>>>>>>>>>>>> the result is attached with the mail.
>>>>>>>>>>>>
>>>>>>>>>>>> The error is
>>>>>>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>>>>>>>> Not enough free slots available to run to
run the job
>>>>>>>>>>>> ....................... Resources available
to scheduler: Number of
>>>>>>>>>>>> instances=0, total number of slots= 0, available
slots=0
>>>>>>>>>>>>
>>>>>>>>>>>> Therefore I suppose that the JobManager does
not find the
>>>>>>>>>>>> TaskManager and checked the logs of the TaskManager
which indeed shows that
>>>>>>>>>>>> the TaskManager is unable to register at
the JobManager for quite a long
>>>>>>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>>>>>> Failed to connect from localhost: Connect
timed out and org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>>>>>>> Failed to connect from address localhost:
Network is
>>>>>>>>>>>> Unreachable messages in the log of the TaskManager.
Later when
>>>>>>>>>>>> it starts up after a number of attempts and
tries to register at the
>>>>>>>>>>>> JobManager, which also fails after a lot
of attempts showing the following
>>>>>>>>>>>> message org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>>>>>> Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>>>>>>>>>>>> (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>>>>>>> Tried to associate with unreachable remote
host [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>>>>>>>> Address is now gated for 5000ms, all messages
to this address will be
>>>>>>>>>>>> delivered to dead letters. Reason: Connection
timed out: /master:6123
>>>>>>>>>>>>
>>>>>>>>>>>> I browsed the internet for these and found
>>>>>>>>>>>>  http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>>>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>>>>>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119
these
>>>>>>>>>>>> links helpful. Stephan Ewen the guy who provided
the solution in both the
>>>>>>>>>>>> links gives a good explanation that the TaskManagers
take quite some time
>>>>>>>>>>>> to register at the JobManager and therefore
I waited for as long as 20 mins
>>>>>>>>>>>> after starting the cluster to run the job.
But even after waiting so long I
>>>>>>>>>>>> get the same error.
>>>>>>>>>>>>
>>>>>>>>>>>> Another suggestion was to run the cluster
in streaming mode. So
>>>>>>>>>>>> I tried it with the command : bin/start-cluster-streaming.sh
and
>>>>>>>>>>>> ran the job but I get the same error. I have
rechecked all the
>>>>>>>>>>>> configurations but I'm unable to find out
the fault.
>>>>>>>>>>>>
>>>>>>>>>>>> I re-checked all the configurations but could
not find anything
>>>>>>>>>>>> wrong. Also checked the port 6123 on master
which is in LISTEN state and
>>>>>>>>>>>> tcp request from worker to master shows SYN_SENT
state using netstat -na
>>>>>>>>>>>> and lsof -i commands.
>>>>>>>>>>>>
>>>>>>>>>>>> I opened the webpage on master http://localhost:8081
but it
>>>>>>>>>>>> shows nothing and localhost:8080 says connection
refused.
>>>>>>>>>>>>
>>>>>>>>>>>> Kindly help me out as it is very important
for me. Let me know
>>>>>>>>>>>> if you have any questions.
>>>>>>>>>>>>
>>>>>>>>>>>> Kind Regards,
>>>>>>>>>>>> Ravinder Kaur
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message