flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: TaskManager unable to register with JobManager
Date Wed, 03 Feb 2016 20:27:09 GMT
Hi,

the TaskManager is starting up, but its not able to register at the job
manager. Did you check the JobManager log? Do you see anything suspicious
there? Are the ports matching?


On Wed, Feb 3, 2016 at 9:23 PM, Ravinder Kaur <neetu0404@gmail.com> wrote:

> Hello,
>
> Thank you for pointing it out. I had a little typo while I edited the
> hostname in flink-conf.yaml. I've reset it and the TaskManager started up.
> But I still can't run the WordCount example and it throws the same
> NoResourceAvaliableException.
>
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce
>
>      ption: Not enough free slots available to run the job. You can
> decrease the oper
>                              ator parallelism or increase the number of
> slots per TaskManager in the configur
>                                                  ation. Task to schedule: <
> Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa
>
>  taSet(WordCountData.java:70)
> (org.apache.flink.api.java.io.CollectionInputFormat
>                                                                )) ->
> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo
>
>            rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with
> groupID < 31e497f2f6
>                                  8c9cee5864c8fddaff3d59 > in sharing group
> < SlotSharingGroup [f9ed1aab933e061a8c
>                                                    e1ecaa3534f18c,
> 037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d
>
>  59] >. Resources available to scheduler: Number of instances=0, total
> number of
>                       slots=0, available slots=0
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(
>
>      Scheduler.java:256)
>         at
> org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed
>
>      iately(Scheduler.java:131)
>         at
> org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio
>
>      n(Execution.java:298)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx
>
>      ecution(ExecutionVertex.java:458)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl
>
>      l(ExecutionJobVertex.java:322)
>         at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe
>
>      cution(ExecutionGraph.java:679)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>
>
>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982
>
>            )
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>
>
>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl
>
>
>  ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
>         ... 8 more
>
> The log of TaskManager again has the same errors as before.
>
> 20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
>        - Failed to connect from address '/slave-IP': connect timed out
> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>        - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is
> unreachable
> 20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
>        - Failed to connect from address '/127.0.0.1': Invalid argument
> 20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
>        - Could not connect to /master-IP:6123. Selecting a local address
> using heuristics.
> 20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager will use hostname/address 'hostname-of-slave'
> (slave-IP) for communication.
> 20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager in streaming mode BATCH_ONLY
> 20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor system at slave_IP:0
> 20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
>        - Slf4jLogger started
> 20:58:59,842 INFO  Remoting
>        - Starting remoting
> 20:59:00,094 INFO  Remoting
>        - Remoting started; listening on addresses
> :[akka.tcp://flink@slave-IP:33813]
> 20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor
> 20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
>       - NettyConfig [server address: hostname-of-master/master-IP, server
> port: 49030, memory segment size (bytes): 32768, transport type: NIO,
> number of server threads: 0 (use Netty's default), number of client
> threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's
> default), client connect timeout (sec): 120, send/receive buffer size
> (bytes): 0 (use Netty's default)]
> 20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Messages between TaskManager and JobManager have a max timeout of
> 100000 milliseconds
> 20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00%
> usable)
> 20:59:00,210 INFO
>  org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
> 64 MB for network buffer pool (number of memory segments: 2048, bytes per
> segment: 32768).
> 20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Using 0.7 of the currently free heap space for Flink managed heap
> memory (293 MB).
> 20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
>        - I/O manager uses directory
> /tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
> 20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
>        - User file cache uses directory
> /tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Starting TaskManager actor at
> akka://flink/user/taskmanager#-157676733.
> 20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager data connection information: hostname-of-master
> (dataPort=49030)
> 20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - TaskManager has 1 task slot(s).
> 20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB
> (used/committed/max)]
> 20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
> (attempt 1, timeout: 500 milliseconds)
> 20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
> (attempt 2, timeout: 1000 milliseconds)
> 20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
> (attempt 3, timeout: 2000 milliseconds)
> 20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
> (attempt 4, timeout: 4000 milliseconds)
> 20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>        - Trying to register at JobManager akka.tcp://flink@master-IP:6123/user/jobmanager
> (attempt 5, timeout: 8000 milliseconds)
>
>
> Kind Regards,
> Ravinder Kaur
>
> On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> This looks like the reason:
>>
>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>> 'hostname-of-master' specified in the configuration
>>
>> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <neetu0404@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> The log file of the Taskmanager now shows the following
>>>
>>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>>         - Unable to load native-hadoop library for your platform... using
>>> builtin-java classes where applicable
>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -
>>> --------------------------------------------------------------------------------
>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
>>> Date:22.11.2015 @ 12:41:12 CET)
>>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Current user: flink
>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation -
>>> 1.7/24.91-b01
>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Maximum heap size: 491 MiBytes
>>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Hadoop version: 2.7.0
>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  JVM Options:
>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     -Xms512M
>>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     -Xmx512M
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     -XX:MaxDirectMemorySize=8388607T
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     -XX:MaxPermSize=256m
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -
>>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -
>>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -
>>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Program Arguments:
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     --configDir
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     /home/flink/flink-0.10.1/conf
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     --streamingMode
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -     batch
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -  Classpath:
>>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          -
>>> --------------------------------------------------------------------------------
>>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Maximum number of open file descriptors is 4096
>>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Loading configuration from /home/flink/flink-0.10.1/conf
>>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Security is not enabled. Starting non-authenticated TaskManager.
>>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>>          - Failed to run TaskManager.
>>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>>> 'hostname-of-master' specified in the configuration
>>>         at
>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>>         at
>>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>>         at
>>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>>         at
>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>>         at
>>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>>         at
>>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>>         at
>>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>>
>>> Kind Regards,
>>> Ravinder Kaur
>>>
>>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <sewen@apache.org> wrote:
>>>
>>>> What do the TaskManger logs say?
>>>>
>>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0404@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address in
>>>>> flink-conf.yaml to the hostname of master node on both the nodes.
>>>>>
>>>>> Now it does not start the Taskmanager at the worker node at all. When
>>>>> I start the cluster using ./bin/start-cluster.sh on master it shows the
>>>>> normal output of starting the Jobmanager and Taskmanager but when I run
jps
>>>>> on the nodes the slave does not have the Taskmanager running.
>>>>>
>>>>> Running the WordCount example again fails showing the same error.
>>>>> Stopping the cluster says no taskmanager to stop.
>>>>>
>>>>> Kind Regards,
>>>>> Ravinder Kaur
>>>>>
>>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <sewen@apache.org>
wrote:
>>>>>
>>>>>> Looks like the network configuration is not correct.
>>>>>>
>>>>>> I would try setting the full host name (like "master.abc.xyz.com")
>>>>>> as jobmanager.rpc.address.
>>>>>>
>>>>>> Greetings,
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0404@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Hello Community,
>>>>>>>
>>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and
have
>>>>>>> setup a 2- node standalone Flink(0.10.1) cluster (one master
and one
>>>>>>> worker). I'm facing the following issue.
>>>>>>>
>>>>>>> Cluster: consists of 2 vms (one master and one worker)
>>>>>>>
>>>>>>> The configurations are done as per
>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>>
>>>>>>> When I start the cluster both the JobManager and the TaskManager
are
>>>>>>> started on the master and worker respectively.
>>>>>>>
>>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>>
>>>>>>> JPS shows all the processes running.
>>>>>>>
>>>>>>> Then I run the following command to run a WordCount example job:
./bin/flink
>>>>>>> run ./examples/WordCount.jar
>>>>>>>
>>>>>>> the result is attached with the mail.
>>>>>>>
>>>>>>> The error is
>>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>>> Not enough free slots available to run to run the job
>>>>>>> ....................... Resources available to scheduler: Number
of
>>>>>>> instances=0, total number of slots= 0, available slots=0
>>>>>>>
>>>>>>> Therefore I suppose that the JobManager does not find the
>>>>>>> TaskManager and checked the logs of the TaskManager which indeed
shows that
>>>>>>> the TaskManager is unable to register at the JobManager for quite
a long
>>>>>>> time. There are org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>> Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils:
>>>>>>> Failed to connect from address localhost: Network is Unreachable
messages
>>>>>>> in the log of the TaskManager. Later when it starts up after
a number of
>>>>>>> attempts and tries to register at the JobManager, which also
fails after a
>>>>>>> lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>> Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>>>>>>> (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>>> Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>>> Address is now gated for 5000ms, all messages to this address
will be
>>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>>>>>
>>>>>>> I browsed the internet for these and found
>>>>>>>  http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>>>>>>> helpful. Stephan Ewen the guy who provided the solution in both
the links
>>>>>>> gives a good explanation that the TaskManagers take quite some
time to
>>>>>>> register at the JobManager and therefore I waited for as long
as 20 mins
>>>>>>> after starting the cluster to run the job. But even after waiting
so long I
>>>>>>> get the same error.
>>>>>>>
>>>>>>> Another suggestion was to run the cluster in streaming mode.
So I
>>>>>>> tried it with the command : bin/start-cluster-streaming.sh and
ran
>>>>>>> the job but I get the same error. I have rechecked all the configurations
>>>>>>> but I'm unable to find out the fault.
>>>>>>>
>>>>>>> I re-checked all the configurations but could not find anything
>>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN
state and
>>>>>>> tcp request from worker to master shows SYN_SENT state using
netstat -na
>>>>>>> and lsof -i commands.
>>>>>>>
>>>>>>> I opened the webpage on master http://localhost:8081 but it shows
>>>>>>> nothing and localhost:8080 says connection refused.
>>>>>>>
>>>>>>> Kindly help me out as it is very important for me. Let me know
if
>>>>>>> you have any questions.
>>>>>>>
>>>>>>> Kind Regards,
>>>>>>> Ravinder Kaur
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message