flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Submit Flink Jobs to YARN running on AWS
Date Tue, 26 Apr 2016 10:16:59 GMT
I've started my own EMR cluster and tried to launch a Flink job from my
local machine on it.
I have to admin that configuring the EMR launched Hadoop for external
access is quite a hassle.

I'm not even able to submit Flink to the YARN cluster because the client
can not connect to the ResourceManager. I've change the resource manager
hostname to the public one in the yarn-site.xml on the cluster and
restarted it, but the client still can not connect.
It seems that the RM address is being overwritten by the Hadoop code?
[image: Inline image 1]

How did you manage to get this working?

In the VM settings, I disabled the "Source/Dest checks", but I don't think
this is related.

Have you considered using Amazon's VPN service, I guess then you would have
"local" access to the cluster?

On YARN, Flink is not using the flink-conf.yaml setting for the
jobmanager's hostname. Its using YARN's "yarn.nodemanager.hostname" from
the yarn-site.xml.
I haven't tried it, but it could work if you set the public hostname of
each NodeManager in the yarn-site.xml.

Also, maybe the product forum / customer support of Amazon can help you
here. Other systems like Spark or Storm have very similar architectures and
will face the same issues. I guess they have some recipes for such
situations.

Regards,
Robert




On Tue, Apr 26, 2016 at 10:47 AM, Robert Metzger <rmetzger@apache.org>
wrote:

> Hi Abhi,
>
> I'll try to reproduce the issue and come up with a solution.
>
> On Tue, Apr 26, 2016 at 1:13 AM, Bajaj, Abhinav <abhinav.bajaj@here.com>
> wrote:
>
>> Hi Fabian,
>>
>> Thanks for your reply and the pointers to documentation.
>>
>> In these steps, I think the Flink client is installed on the master node,
>> referring to steps mentioned in Flink docs here
>> <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html>
>> .
>> However, the scenario I have is to run the client on my local machine and
>> submit jobs remotely to the YARN Cluster (running on EMR or independently).
>>
>> Let me describe in more detail here.
>> I am trying to submit a single Flink Job to YARN using the client,
>> running on my dev machine -
>>
>> ./bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096
>>  ./examples/batch/WordCount.jar
>>
>> In my understanding, YARN (running in AWS) allocates a container for the
>> Jobmanager.
>> Jobmanager discovers the IP and started the Actor system. At this step
>> the IP it uses is the internal IP address, of the EC2 instance.
>>
>> The client, running on my dev machine, is not able to connect to the
>> Jobmanager for reasons explained in my mail below.
>>
>> Is there a way, where I can set Jobmanager to use the hostname and not
>> the IP address?
>>
>> Or any other suggestions?
>>
>> Thanks,
>> Abhi
>>
>> *[image: cid:DACBF116-FD8C-48DB-B91D-D54510B306E8]*
>>
>> *Abhinav Bajaj*
>>
>> Senior Engineer
>>
>> HERE Predictive Analytics
>>
>> Office:  +12062092767
>>
>> Mobile: +17083299516
>>
>> *HERE Seattle*
>>
>> 701 Pike Street, #2000, Seattle, WA 98101, USA
>>
>> *47° 36' 41" N. 122° 19' 57" W*
>>
>> *HERE Maps*
>>
>>
>>
>>
>> From: Fabian Hueske <fhueske@gmail.com>
>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>> Date: Wednesday, March 9, 2016 at 12:51 AM
>> To: "user@flink.apache.org" <user@flink.apache.org>
>> Subject: Re: Submit Flink Jobs to YARN running on AWS
>>
>> Hi Abhi,
>>
>> I have used Flink on EMR via YARN a couple of times without problems.
>> I started a Flink YARN session like this:
>>
>> ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
>>
>> This will start five YARN containers (1 JobManager with 1024MB, 4
>> Taskmanagers with 4096MB). See more config options in the documentation [1].
>> In one of the last lines of the std-out output you should find a line
>> that tells you the IP and port of the JobManager.
>>
>> With the IP and port, you can submit a job as follows:
>>
>> ./bin/flink run -m jmIP:jmPort -p 4 jobJarFile.jar <arguments>
>>
>> This will send the job to the JobManager specified by IP and port and
>> execute the program with a parallelism of 4. See more config options in the
>> documentation [2].
>>
>> If this does not help, could you share the exact command that you use to
>> start the YARN session and submit the job?
>>
>> Best, Fabian
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/yarn_setup.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/cli.html
>>
>> 2016-03-08 0:25 GMT+01:00 Bajaj, Abhinav <abhinav.bajaj@here.com>:
>>
>>> Hi,
>>>
>>> I am a newbie to Flink and trying to use it in AWS.
>>> I have created a YARN cluster on AWS EC2 machines.
>>> Trying to submit Flink job to the remote YARN cluster using the Flink
>>> Client running on my local machine.
>>>
>>> The Jobmanager start successfully on the YARN container but the client
>>> is not able to connect to the Jobmanager.
>>>
>>> Flink Client Logs -
>>>
>>> 13:57:34,877 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>         - Deploying cluster, current state ACCEPTED
>>> 13:57:35,951 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>         - Deploying cluster, current state ACCEPTED
>>> 13:57:37,027 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>         - YARN application has been deployed successfully.
>>> 13:57:37,100 INFO  org.apache.flink.yarn.FlinkYarnCluster
>>>          - Start actor system.
>>> 13:57:37,532 INFO  org.apache.flink.yarn.FlinkYarnCluster
>>>          - Start application client.
>>> YARN cluster started
>>> JobManager web interface address
>>> http://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:8088/proxy/application_1456184947990_0003/
>>> Waiting until all TaskManagers have connected
>>> 13:57:37,540 INFO  org.apache.flink.yarn.ApplicationClient
>>>         - Notification about new leader address akka.tcp:
>>> //flink@54.35.41.12:41292/user/jobmanager with session ID null.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>> 13:57:37,543 INFO  org.apache.flink.yarn.ApplicationClient
>>>         - Received address of new leader akka.tcp://flink@54.35.41.12:41292/user/jobmanager
>>> with session ID null.
>>> 13:57:37,543 INFO  org.apache.flink.yarn.ApplicationClient
>>>         - Disconnect from JobManager null.
>>> 13:57:37,545 INFO  org.apache.flink.yarn.ApplicationClient
>>>         - Trying to register at JobManager akka.tcp://flink@54.35.41.12
>>> :41292/user/jobmanager.
>>> No status updates from the YARN cluster received so far. Waiting ...
>>>
>>> The logs of the Jobmanager contains the following -
>>>
>>> 21:57:39,142 ERROR akka.remote.EndpointWriter                               
    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@54.35.41.12:41292/]]
arriving at [akka.tcp://flink@54.35.41.12:41292] inbound addresses are [akka.tcp://flink@172.31.23.18:41292]
>>> 21:57:40,782 INFO  org.apache.flink.runtime.instance.InstanceManager        
    - Registered TaskManager at ec2-54-35-41-12 (akka.tcp://flink@172.31.23.18:60565/user/taskmanager)
as 72101dd2ee94caa7a5ec5a75488359aa. Current number of registered hosts is 1. Current number
of alive task slots is 1.
>>> 21:57:41,162 ERROR akka.remote.EndpointWriter                               
    - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient [Actor[akka.tcp://flink@54.35.41.12:41292/]]
arriving at [akka.tcp://flink@54.35.41.12:41292] inbound addresses are [akka.tcp://flink@172.31.23.18:41292]
>>>
>>> It seems the problem is in the mismatch of the Jobmanager Akka actors
>>> system running address and the one user by the Client.
>>> 172.31.23.18 – is the internal private IP of the EC2 machine where the
>>> Jobmanager container is running.
>>> 54.35.41.12 – is the external IP of the EC2 machine, used by Flink
>>> client to submit the Job.
>>> Because of this mismatch the messages are ignored by the Akka actor
>>> System.
>>>
>>> Can someone please help me with this issue.
>>> I can share the detailed logs, if required.
>>>
>>> Thanks,
>>> Abhi
>>>
>>>
>>
>

Mime
View raw message