flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravinder Kaur <neetu0...@gmail.com>
Subject Re: TaskManager unable to register with JobManager
Date Wed, 03 Feb 2016 17:34:45 GMT

Thanks for the quick reply. I tried to set jobmanager.rpc.address in
flink-conf.yaml to the hostname of master node on both the nodes.

Now it does not start the Taskmanager at the worker node at all. When I
start the cluster using ./bin/start-cluster.sh on master it shows the
normal output of starting the Jobmanager and Taskmanager but when I run jps
on the nodes the slave does not have the Taskmanager running.

Running the WordCount example again fails showing the same error. Stopping
the cluster says no taskmanager to stop.

Kind Regards,
Ravinder Kaur

On Wed, Feb 3, 2016 at 5:47 PM, Stephan Ewen <sewen@apache.org> wrote:

> Looks like the network configuration is not correct.
> I would try setting the full host name (like "master.abc.xyz.com") as
> jobmanager.rpc.address.
> Greetings,
> Stephan
> On Wed, Feb 3, 2016 at 5:43 PM, Ravinder Kaur <neetu0404@gmail.com> wrote:
>> Hello Community,
>> I'm a student and new to Apache Flink. I'm trying to learn and have setup
>> a 2- node standalone Flink(0.10.1) cluster (one master and one worker). I'm
>> facing the following issue.
>> Cluster: consists of 2 vms (one master and one worker)
>> The configurations are done as per
>> https://ci.apache.org/projects/flink/flink-docs-release-0.10/setup/cluster_setup.html
>> When I start the cluster both the JobManager and the TaskManager are
>> started on the master and worker respectively.
>> Command to start the cluster : bin/start-cluster.sh
>> JPS shows all the processes running.
>> Then I run the following command to run a WordCount example job: ./bin/flink
>> run ./examples/WordCount.jar
>> the result is attached with the mail.
>> The error is
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailabeException:
>> Not enough free slots available to run to run the job
>> ....................... Resources available to scheduler: Number of
>> instances=0, total number of slots= 0, available slots=0
>> Therefore I suppose that the JobManager does not find the TaskManager and
>> checked the logs of the TaskManager which indeed shows that the TaskManager
>> is unable to register at the JobManager for quite a long time. There are org.apache.flink.runtime.net.ConnectionUtils:
>> Failed to connect from localhost: Connect timed out and org.apache.flink.runtime.net.ConnectionUtils:
>> Failed to connect from address localhost: Network is Unreachable messages
>> in the log of the TaskManager. Later when it starts up after a number of
>> attempts and tries to register at the JobManager, which also fails after a
>> lot of attempts showing the following message org.apache.flink.runtime.taskmanager.Taskmanager:
>> Trying to register at JobManager akka.tcp://flink@master:6123/user'/jobmanager
>> (attempt:92, timeout:30seconds) and org.apache.flink.runtime.taskmanager.Taskmanager:
>> Tried to associate with unreachable remote host [akka.tcp://flink@master:6123/user/jobmanager].
>> Address is now gated for 5000ms, all messages to this address will be
>> delivered to dead letters. Reason: Connection timed out: /master:6123
>> I browsed the internet for these and found
>>  http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb
>> <http://stackoverflow.com/questions/33601020/flink-job-wont-run-with-higher-taskmanager-heap-mb>
>> and https://issues.apache.org/jira/browse/FLINK-1119 these links
>> helpful. Stephan Ewen the guy who provided the solution in both the links
>> gives a good explanation that the TaskManagers take quite some time to
>> register at the JobManager and therefore I waited for as long as 20 mins
>> after starting the cluster to run the job. But even after waiting so long I
>> get the same error.
>> Another suggestion was to run the cluster in streaming mode. So I tried
>> it with the command : bin/start-cluster-streaming.sh and ran the job but
>> I get the same error. I have rechecked all the configurations but I'm
>> unable to find out the fault.
>> I re-checked all the configurations but could not find anything wrong.
>> Also checked the port 6123 on master which is in LISTEN state and tcp
>> request from worker to master shows SYN_SENT state using netstat -na and
>> lsof -i commands.
>> I opened the webpage on master http://localhost:8081 but it shows
>> nothing and localhost:8080 says connection refused.
>> Kindly help me out as it is very important for me. Let me know if you
>> have any questions.
>> Kind Regards,
>> Ravinder Kaur

View raw message