flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: TaskManager unable to register with JobManager
Date Wed, 03 Feb 2016 18:19:53 GMT
What do the TaskManger logs say?

On Wed, Feb 3, 2016 at 6:34 PM, Ravinder Kaur <neetu0404@gmail.com> wrote:

> Hello,
>
> Thanks for the quick reply. I tried to set jobmanager.rpc.address in
> flink-conf.yaml to the hostname of master node on both the nodes.
>
> Now it does not start the Taskmanager at the worker node at all. When I
> start the cluster using ./bin/start-cluster.sh on master it shows the
> normal output of starting the Jobmanager and Taskmanager but when I run jps
> on the nodes the slave does not have the Taskmanager running.
>
> Running the WordCount example again fails showing the same error. Stopping
> the cluster says no taskmanager to stop.
>
> Kind Regards,
> Ravinder Kaur
>
> On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <sewen@apache.org> wrote:
>
>> Looks like the network configuration is not correct.
>>
>> I would try setting the full host name (like "master.abc.xyz.com") as
>> jobmanager.rpc.address.
>>
>> Greetings,
>> Stephan
>>
>>
>> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0404@gmail.com>
>> wrote:
>>
>>>
>>> Hello Community,
>>>
>>> I'm a student and new to Apache Flink. I'm trying to learn and have
>>> setup a 2- node standalone Flink(0.10.1) cluster (one master and one
>>> worker). I'm facing the following issue.
>>>
>>> Cluster: consists of 2 vms (one master and one worker)
>>>
>>> The configurations are done as per
>>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>>>
>>> When I start the cluster both the JobManager and the TaskManager are
>>> started on the master and worker respectively.
>>>
>>> Command to start the cluster : bin/start-cluster.sh
>>>
>>> JPS shows all the processes running.
>>>
>>> Then I run the following command to run a WordCount example job: ./bin/flink
>>> run ./examples/WordCount.jar
>>>
>>> the result is attached with the mail.
>>>
>>> The error is
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>>> Not enough free slots available to run to run the job
>>> ....................... Resources available to scheduler: Number of
>>> instances=0, total number of slots= 0, available slots=0
>>>
>>> Therefore I suppose that the JobManager does not find the TaskManager
>>> and checked the logs of the TaskManager which indeed shows that the
>>> TaskManager is unable to register at the JobManager for quite a long time. There
>>> are org.apache.flink.runtime.net.ConnectionUtils: Failed to connect
>>> from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils:
>>> Failed to connect from address localhost: Network is Unreachable messages
>>> in the log of the TaskManager. Later when it starts up after a number of
>>> attempts and tries to register at the JobManager, which also fails after a
>>> lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager:
>>> Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>>> (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager:
>>> Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager].
>>> Address is now gated for 5000ms, all messages to this address will be
>>> delivered to dead letters. Reason: Connection timed out: /master:6123
>>>
>>> I browsed the internet for these and found
>>>  http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>>> helpful. Stephan Ewen the guy who provided the solution in both the links
>>> gives a good explanation that the TaskManagers take quite some time to
>>> register at the JobManager and therefore I waited for as long as 20 mins
>>> after starting the cluster to run the job. But even after waiting so long I
>>> get the same error.
>>>
>>> Another suggestion was to run the cluster in streaming mode. So I tried
>>> it with the command : bin/start-cluster-streaming.sh and ran the job
>>> but I get the same error. I have rechecked all the configurations but I'm
>>> unable to find out the fault.
>>>
>>> I re-checked all the configurations but could not find anything wrong.
>>> Also checked the port 6123 on master which is in LISTEN state and tcp
>>> request from worker to master shows SYN_SENT state using netstat -na and
>>> lsof -i commands.
>>>
>>> I opened the webpage on master http://localhost:8081 but it shows
>>> nothing and localhost:8080 says connection refused.
>>>
>>> Kindly help me out as it is very important for me. Let me know if you
>>> have any questions.
>>>
>>> Kind Regards,
>>> Ravinder Kaur
>>>
>>>
>>
>

Mime
View raw message