flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravinder Kaur <neetu0...@gmail.com>
Subject Re: TaskManager unable to register with JobManager
Date Wed, 03 Feb 2016 20:23:22 GMT
Hello,

Thank you for pointing it out. I had a little typo while I edited the
hostname in flink-conf.yaml. I've reset it and the TaskManager started up.
But I still can't run the WordCount example and it throws the same
NoResourceAvaliableException.

Caused by:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableExce

     ption: Not enough free slots available to run the job. You can
decrease the oper
                             ator parallelism or increase the number of
slots per TaskManager in the configur
                                                 ation. Task to schedule: <
Attempt #0 (CHAIN DataSource (at getDefaultTextLineDa

 taSet(WordCountData.java:70)
(org.apache.flink.api.java.io.CollectionInputFormat
                                                               )) ->
FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(Wo

           rdCount.java:72) (1/1)) @ (unassigned) - [SCHEDULED] > with
groupID < 31e497f2f6
                                 8c9cee5864c8fddaff3d59 > in sharing group
< SlotSharingGroup [f9ed1aab933e061a8c
                                                   e1ecaa3534f18c,
037bb78a1902f7edea69a978ad7b54ce, 31e497f2f68c9cee5864c8fddaff3d

 59] >. Resources available to scheduler: Number of instances=0, total
number of
                      slots=0, available slots=0
        at
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleTask(

     Scheduler.java:256)
        at
org.apache.flink.runtime.jobmanager.scheduler.Scheduler.scheduleImmed

     iately(Scheduler.java:131)
        at
org.apache.flink.runtime.executiongraph.Execution.scheduleForExecutio

     n(Execution.java:298)
        at
org.apache.flink.runtime.executiongraph.ExecutionVertex.scheduleForEx

     ecution(ExecutionVertex.java:458)
        at
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.scheduleAl

     l(ExecutionJobVertex.java:322)
        at
org.apache.flink.runtime.executiongraph.ExecutionGraph.scheduleForExe

     cution(ExecutionGraph.java:679)
        at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl


 ink$runtime$jobmanager$JobManager$$submitJob$1.apply$mcV$sp(JobManager.scala:982

           )
        at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl


 ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$fl


 ink$runtime$jobmanager$JobManager$$submitJob$1.apply(JobManager.scala:962)
        ... 8 more

The log of TaskManager again has the same errors as before.

20:58:58,457 INFO  org.apache.flink.runtime.net.ConnectionUtils
     - Failed to connect from address '/slave-IP': connect timed out
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
     - Failed to connect from address '/0:0:0:0:0:0:0:1%1': Network is
unreachable
20:58:58,458 INFO  org.apache.flink.runtime.net.ConnectionUtils
     - Failed to connect from address '/127.0.0.1': Invalid argument
20:58:59,048 WARN  org.apache.flink.runtime.net.ConnectionUtils
     - Could not connect to /master-IP:6123. Selecting a local address
using heuristics.
20:58:59,050 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager will use hostname/address 'hostname-of-slave' (slave-IP)
for communication.
20:58:59,051 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager in streaming mode BATCH_ONLY
20:58:59,052 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor system at slave_IP:0
20:58:59,776 INFO  akka.event.slf4j.Slf4jLogger
     - Slf4jLogger started
20:58:59,842 INFO  Remoting
     - Starting remoting
20:59:00,094 INFO  Remoting
     - Remoting started; listening on addresses :[akka.tcp://flink@slave-IP
:33813]
20:59:00,100 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor
20:59:00,125 INFO  org.apache.flink.runtime.io.network.netty.NettyConfig
      - NettyConfig [server address: hostname-of-master/master-IP, server
port: 49030, memory segment size (bytes): 32768, transport type: NIO,
number of server threads: 0 (use Netty's default), number of client
threads: 0 (use Netty's default), server connect backlog: 0 (use Netty's
default), client connect timeout (sec): 120, send/receive buffer size
(bytes): 0 (use Netty's default)]
20:59:00,131 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Messages between TaskManager and JobManager have a max timeout of
100000 milliseconds
20:59:00,142 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Temporary file directory '/tmp': total 4 GB, usable 1 GB (25.00%
usable)
20:59:00,210 INFO
 org.apache.flink.runtime.io.network.buffer.NetworkBufferPool  - Allocated
64 MB for network buffer pool (number of memory segments: 2048, bytes per
segment: 32768).
20:59:00,323 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Using 0.7 of the currently free heap space for Flink managed heap
memory (293 MB).
20:59:00,565 INFO  org.apache.flink.runtime.io.disk.iomanager.IOManager
     - I/O manager uses directory
/tmp/flink-io-c7796b82-6676-4604-97fd-df09001a84e8 for spill files.
20:59:00,578 INFO  org.apache.flink.runtime.filecache.FileCache
     - User file cache uses directory
/tmp/flink-dist-cache-13ed3e76-cf1e-46fa-9ba2-5177e801429e
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Starting TaskManager actor at
akka://flink/user/taskmanager#-157676733.
20:59:00,908 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager data connection information: hostname-of-master
(dataPort=49030)
20:59:00,909 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - TaskManager has 1 task slot(s).
20:59:00,910 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Memory usage stats: [HEAP: 376/491/491 MB, NON HEAP: 24/49/304 MB
(used/committed/max)]
20:59:00,917 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager
akka.tcp://flink@master-IP:6123/user/jobmanager
(attempt 1, timeout: 500 milliseconds)
20:59:01,443 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager
akka.tcp://flink@master-IP:6123/user/jobmanager
(attempt 2, timeout: 1000 milliseconds)
20:59:02,873 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager
akka.tcp://flink@master-IP:6123/user/jobmanager
(attempt 3, timeout: 2000 milliseconds)
20:59:04,893 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager
akka.tcp://flink@master-IP:6123/user/jobmanager
(attempt 4, timeout: 4000 milliseconds)
20:59:08,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager
     - Trying to register at JobManager
akka.tcp://flink@master-IP:6123/user/jobmanager
(attempt 5, timeout: 8000 milliseconds)


Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 8:12 PM, Stephan Ewen <sewen@apache.org> wrote:

> This looks like the reason:
>
> java.net.UnknownHostException: Cannot resolve the JobManager hostname
> 'hostname-of-master' specified in the configuration
>
> On Wed, Feb 3, 2016 at 7:29 PM, Ravinder Kaur <neetu0404@gmail.com> wrote:
>
>> Hello,
>>
>> The log file of the Taskmanager now shows the following
>>
>> 18:27:10,082 WARN  org.apache.hadoop.util.NativeCodeLoader
>>         - Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -
>> --------------------------------------------------------------------------------
>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Starting TaskManager (Version: 0.10.1, Rev:2e9b231,
>> Date:22.11.2015 @ 12:41:12 CET)
>> 18:27:10,244 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Current user: flink
>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.7/24.91-b01
>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Maximum heap size: 491 MiBytes
>> 18:27:10,245 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  JAVA_HOME: /usr/lib/jvm/java-1.7.0-openjdk-amd64
>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Hadoop version: 2.7.0
>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  JVM Options:
>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     -Xms512M
>> 18:27:10,247 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     -Xmx512M
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     -XX:MaxDirectMemorySize=8388607T
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     -XX:MaxPermSize=256m
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -
>> -Dlog.file=/home/flink/flink-0.10.1/log/flink-flink-taskmanager-0-vm-10-155-208-137.cloud.mwn.de.log
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -
>> -Dlog4j.configuration=file:/home/flink/flink-0.10.1/conf/log4j.properties
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -
>> -Dlogback.configurationFile=file:/home/flink/flink-0.10.1/conf/logback.xml
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Program Arguments:
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     --configDir
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     /home/flink/flink-0.10.1/conf
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     --streamingMode
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -     batch
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -  Classpath:
>> /home/flink/flink-0.10.1/lib/flink-dist_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/flink-python_2.11-0.10.1.jar:/home/flink/flink-0.10.1/lib/log4j-1.2.17.jar:/home/flink/flink-0.10.1/lib/slf4j-log4j12-1.7.7.jar:/usr/lib/jvm/java-1.7.0-openjdk-amd64/lib/tools.jar::
>> 18:27:10,248 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        -
>> --------------------------------------------------------------------------------
>> 18:27:10,252 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Maximum number of open file descriptors is 4096
>> 18:27:10,277 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Loading configuration from /home/flink/flink-0.10.1/conf
>> 18:27:10,356 INFO  org.apache.flink.runtime.taskmanager.TaskManager
>>        - Security is not enabled. Starting non-authenticated TaskManager.
>> 18:27:10,365 ERROR org.apache.flink.runtime.taskmanager.TaskManager
>>        - Failed to run TaskManager.
>> java.net.UnknownHostException: Cannot resolve the JobManager hostname
>> 'hostname-of-master' specified in the configuration
>>         at
>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:79)
>>         at
>> org.apache.flink.runtime.util.StandaloneUtils.createLeaderRetrievalService(StandaloneUtils.java:48)
>>         at
>> org.apache.flink.runtime.util.LeaderRetrievalUtils.createLeaderRetrievalService(LeaderRetrievalUtils.java:69)
>>         at
>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndPort(TaskManager.scala:1351)
>>         at
>> org.apache.flink.runtime.taskmanager.TaskManager$.selectNetworkInterfaceAndRunTaskManager(TaskManager.scala:1328)
>>         at
>> org.apache.flink.runtime.taskmanager.TaskManager$.main(TaskManager.scala:1240)
>>         at
>> org.apache.flink.runtime.taskmanager.TaskManager.main(TaskManager.scala)
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>> On Wed, Feb 3, 2016 at 7:19 PM, Stephan Ewen <sewen@apache.org> wrote:
>>
>>> What do the TaskManger logs say?
>>>
>>> On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0404@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Thanks for the quick reply. I tried to set jobmanager.rpc.address in
>>>> flink-conf.yaml to the hostname of master node on both the nodes.
>>>>
>>>> Now it does not start the Taskmanager at the worker node at all. When I
>>>> start the cluster using ./bin/start-cluster.sh on master it shows the
>>>> normal output of starting the Jobmanager and Taskmanager but when I run jps
>>>> on the nodes the slave does not have the Taskmanager running.
>>>>
>>>> Running the WordCount example again fails showing the same error.
>>>> Stopping the cluster says no taskmanager to stop.
>>>>
>>>> Kind Regards,
>>>> Ravinder Kaur
>>>>
>>>> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <sewen@apache.org> wrote:
>>>>
>>>>> Looks like the network configuration is not correct.
>>>>>
>>>>> I would try setting the full host name (like "master.abc.xyz.com") as
>>>>> jobmanager.rpc.address.
>>>>>
>>>>> Greetings,
>>>>> Stephan
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0404@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hello Community,
>>>>>>
>>>>>> I'm a student and new to Apache Flink. I'm trying to learn and have
>>>>>> setup a 2- node standalone Flink(0.10.1) cluster (one master and
one
>>>>>> worker). I'm facing the following issue.
>>>>>>
>>>>>> Cluster: consists of 2 vms (one master and one worker)
>>>>>>
>>>>>> The configurations are done as per
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>>>>
>>>>>> When I start the cluster both the JobManager and the TaskManager
are
>>>>>> started on the master and worker respectively.
>>>>>>
>>>>>> Command to start the cluster : bin/start-cluster.sh
>>>>>>
>>>>>> JPS shows all the processes running.
>>>>>>
>>>>>> Then I run the following command to run a WordCount example job:
./bin/flink
>>>>>> run ./examples/WordCount.jar
>>>>>>
>>>>>> the result is attached with the mail.
>>>>>>
>>>>>> The error is
>>>>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>>>>> Not enough free slots available to run to run the job
>>>>>> ....................... Resources available to scheduler: Number
of
>>>>>> instances=0, total number of slots= 0, available slots=0
>>>>>>
>>>>>> Therefore I suppose that the JobManager does not find the TaskManager
>>>>>> and checked the logs of the TaskManager which indeed shows that the
>>>>>> TaskManager is unable to register at the JobManager for quite a long
time. There
>>>>>> are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect
>>>>>> from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils:
>>>>>> Failed to connect from address localhost: Network is Unreachable
messages
>>>>>> in the log of the TaskManager. Later when it starts up after a number
of
>>>>>> attempts and tries to register at the JobManager, which also fails
after a
>>>>>> lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>> Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>>>>>> (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager:
>>>>>> Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager].
>>>>>> Address is now gated for 5000ms, all messages to this address will
be
>>>>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>>>>
>>>>>> I browsed the internet for these and found
>>>>>>  http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>>>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>>>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>>>>>> helpful. Stephan Ewen the guy who provided the solution in both the
links
>>>>>> gives a good explanation that the TaskManagers take quite some time
to
>>>>>> register at the JobManager and therefore I waited for as long as
20 mins
>>>>>> after starting the cluster to run the job. But even after waiting
so long I
>>>>>> get the same error.
>>>>>>
>>>>>> Another suggestion was to run the cluster in streaming mode. So I
>>>>>> tried it with the command : bin/start-cluster-streaming.sh and ran
>>>>>> the job but I get the same error. I have rechecked all the configurations
>>>>>> but I'm unable to find out the fault.
>>>>>>
>>>>>> I re-checked all the configurations but could not find anything
>>>>>> wrong. Also checked the port 6123 on master which is in LISTEN state
and
>>>>>> tcp request from worker to master shows SYN_SENT state using netstat
-na
>>>>>> and lsof -i commands.
>>>>>>
>>>>>> I opened the webpage on master http://localhost:8081 but it shows
>>>>>> nothing and localhost:8080 says connection refused.
>>>>>>
>>>>>> Kindly help me out as it is very important for me. Let me know if
you
>>>>>> have any questions.
>>>>>>
>>>>>> Kind Regards,
>>>>>> Ravinder Kaur
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message