flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh <jof...@gmail.com>
Subject Re: Submit Flink Jobs to YARN running on AWS
Date Mon, 06 Jun 2016 18:55:41 GMT
Hi Abhi,

I'm also looking to deploy Flink jobs remotely to YARN, and eventually
automate it - just wondering if you found a way to do it?

Thanks,
Josh

On Wed, May 25, 2016 at 12:36 AM, Bajaj, Abhinav <abhinav.bajaj@here.com>
wrote:

> Hi,
>
> Has anyone tried to submit a Flink Job remotely to Yarn running in AWS ?
> The case I am stuck with is where the Flink client is on my laptop and
> YARN is running on AWS.
>
> @Robert, Did you get a chance to try this out?
>
> Regards,
> Abhi
>
> From: "Bajaj, Abhinav" <abhinav.bajaj@here.com>
> Date: Friday, April 29, 2016 at 3:50 PM
>
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Re: Submit Flink Jobs to YARN running on AWS
>
> Hi Robert,
>
> Thanks for your reply.
>
> I am using the Public DNS for the EC2 machines in the yarn and hdfs
> configuration files. It looks like "
> ec2-203-0-113-25.compute-1.amazonaws.com”
> You should be able to connect then.
>
> I have hadoop installed locally and the YARN_CONF_DIR is pointing to it.
> The yarn-site.xml and core-site.xml files use the resource manager
> address(Public DNS) running in AWS.
>
> So, whenever I submit the job using the client on my laptop, it connects
> to RM.
> The RM starts the YARN application and starts the Job manager.
> The job manager starts the actor system using the internal IP of the
> nodemanager. In my understanding, this is where the problem lies.
>
> The local client tries to connect to the Job manager actor system but the
> messages are dropped by the actor system as the IP address(EC2 internal IP)
> that actor system started with does not match the external IP
> address(Public IP) that was used by Flink client to send the message.
> Please see my first mail below for detailed logs.
>
> Please keep me posted with your progress.
>
> I plan to move the cluster to VPC for other reasons.
> I have limited knowledge of VPC but I guess the difference in internal and
> external IP address will not be resolved.
> Please correct if your views are different.
>
> It will be great if you are able to reproduce the issue.
>
> Thanks again.
> Abhi
>
>
>
> *[image: cid:DACBF116-FD8C-48DB-B91D-D54510B306E8]*
>
> *Abhinav Bajaj*
>
> Senior Engineer
>
> HERE Predictive Analytics
>
> Office:  +12062092767
>
> Mobile: +17083299516
>
> *HERE Seattle*
>
> 701 Pike Street, #2000, Seattle, WA 98101, USA
>
> *47° 36' 41" N. 122° 19' 57" W*
>
> *HERE Maps*
>
>
>
>
> From: Robert Metzger <rmetzger@apache.org>
> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
> Date: Tuesday, April 26, 2016 at 3:16 AM
> To: "user@flink.apache.org" <user@flink.apache.org>
> Subject: Re: Submit Flink Jobs to YARN running on AWS
>
> I've started my own EMR cluster and tried to launch a Flink job from my
> local machine on it.
> I have to admin that configuring the EMR launched Hadoop for external
> access is quite a hassle.
>
> I'm not even able to submit Flink to the YARN cluster because the client
> can not connect to the ResourceManager. I've change the resource manager
> hostname to the public one in the yarn-site.xml on the cluster and
> restarted it, but the client still can not connect.
> It seems that the RM address is being overwritten by the Hadoop code?
> [image: Inline image 1]
>
> How did you manage to get this working?
>
> In the VM settings, I disabled the "Source/Dest checks", but I don't think
> this is related.
>
> Have you considered using Amazon's VPN service, I guess then you would
> have "local" access to the cluster?
>
> On YARN, Flink is not using the flink-conf.yaml setting for the
> jobmanager's hostname. Its using YARN's "yarn.nodemanager.hostname" from
> the yarn-site.xml.
> I haven't tried it, but it could work if you set the public hostname of
> each NodeManager in the yarn-site.xml.
>
> Also, maybe the product forum / customer support of Amazon can help you
> here. Other systems like Spark or Storm have very similar architectures and
> will face the same issues. I guess they have some recipes for such
> situations.
>
> Regards,
> Robert
>
>
>
>
> On Tue, Apr 26, 2016 at 10:47 AM, Robert Metzger <rmetzger@apache.org>
> wrote:
>
>> Hi Abhi,
>>
>> I'll try to reproduce the issue and come up with a solution.
>>
>> On Tue, Apr 26, 2016 at 1:13 AM, Bajaj, Abhinav <abhinav.bajaj@here.com>
>> wrote:
>>
>>> Hi Fabian,
>>>
>>> Thanks for your reply and the pointers to documentation.
>>>
>>> In these steps, I think the Flink client is installed on the master
>>> node, referring to steps mentioned in Flink docs here
>>> <https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/aws.html>
>>> .
>>> However, the scenario I have is to run the client on my local machine
>>> and submit jobs remotely to the YARN Cluster (running on EMR or
>>> independently).
>>>
>>> Let me describe in more detail here.
>>> I am trying to submit a single Flink Job to YARN using the client,
>>> running on my dev machine -
>>>
>>> ./bin/flink run -m yarn-cluster -yn 4 -yjm 1024 -ytm 4096
>>>  ./examples/batch/WordCount.jar
>>>
>>> In my understanding, YARN (running in AWS) allocates a container for the
>>> Jobmanager.
>>> Jobmanager discovers the IP and started the Actor system. At this step
>>> the IP it uses is the internal IP address, of the EC2 instance.
>>>
>>> The client, running on my dev machine, is not able to connect to the
>>> Jobmanager for reasons explained in my mail below.
>>>
>>> Is there a way, where I can set Jobmanager to use the hostname and not
>>> the IP address?
>>>
>>> Or any other suggestions?
>>>
>>> Thanks,
>>> Abhi
>>>
>>> *[image: cid:DACBF116-FD8C-48DB-B91D-D54510B306E8]*
>>>
>>> *Abhinav Bajaj*
>>>
>>> Senior Engineer
>>>
>>> HERE Predictive Analytics
>>>
>>> Office:  +12062092767
>>>
>>> Mobile: +17083299516
>>>
>>> *HERE Seattle*
>>>
>>> 701 Pike Street, #2000, Seattle, WA 98101, USA
>>>
>>> *47° 36' 41" N. 122° 19' 57" W*
>>>
>>> *HERE Maps*
>>>
>>>
>>>
>>>
>>> From: Fabian Hueske <fhueske@gmail.com>
>>> Reply-To: "user@flink.apache.org" <user@flink.apache.org>
>>> Date: Wednesday, March 9, 2016 at 12:51 AM
>>> To: "user@flink.apache.org" <user@flink.apache.org>
>>> Subject: Re: Submit Flink Jobs to YARN running on AWS
>>>
>>> Hi Abhi,
>>>
>>> I have used Flink on EMR via YARN a couple of times without problems.
>>> I started a Flink YARN session like this:
>>>
>>> ./bin/yarn-session.sh -n 4 -jm 1024 -tm 4096
>>>
>>> This will start five YARN containers (1 JobManager with 1024MB, 4
>>> Taskmanagers with 4096MB). See more config options in the documentation [1].
>>> In one of the last lines of the std-out output you should find a line
>>> that tells you the IP and port of the JobManager.
>>>
>>> With the IP and port, you can submit a job as follows:
>>>
>>> ./bin/flink run -m jmIP:jmPort -p 4 jobJarFile.jar <arguments>
>>>
>>> This will send the job to the JobManager specified by IP and port and
>>> execute the program with a parallelism of 4. See more config options in the
>>> documentation [2].
>>>
>>> If this does not help, could you share the exact command that you use to
>>> start the YARN session and submit the job?
>>>
>>> Best, Fabian
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/setup/yarn_setup.html
>>> [2]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/cli.html
>>>
>>> 2016-03-08 0:25 GMT+01:00 Bajaj, Abhinav <abhinav.bajaj@here.com>:
>>>
>>>> Hi,
>>>>
>>>> I am a newbie to Flink and trying to use it in AWS.
>>>> I have created a YARN cluster on AWS EC2 machines.
>>>> Trying to submit Flink job to the remote YARN cluster using the Flink
>>>> Client running on my local machine.
>>>>
>>>> The Jobmanager start successfully on the YARN container but the client
>>>> is not able to connect to the Jobmanager.
>>>>
>>>> Flink Client Logs -
>>>>
>>>> 13:57:34,877 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>>           - Deploying cluster, current state ACCEPTED
>>>> 13:57:35,951 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>>           - Deploying cluster, current state ACCEPTED
>>>> 13:57:37,027 INFO  org.apache.flink.yarn.FlinkYarnClient
>>>>           - YARN application has been deployed successfully.
>>>> 13:57:37,100 INFO  org.apache.flink.yarn.FlinkYarnCluster
>>>>          - Start actor system.
>>>> 13:57:37,532 INFO  org.apache.flink.yarn.FlinkYarnCluster
>>>>          - Start application client.
>>>> YARN cluster started
>>>> JobManager web interface address
>>>> http://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:8088/proxy/application_1456184947990_0003/
>>>> Waiting until all TaskManagers have connected
>>>> 13:57:37,540 INFO  org.apache.flink.yarn.ApplicationClient
>>>>           - Notification about new leader address akka.tcp:
>>>> //flink@54.35.41.12:41292/user/jobmanager with session ID null.
>>>> No status updates from the YARN cluster received so far. Waiting ...
>>>> 13:57:37,543 INFO  org.apache.flink.yarn.ApplicationClient
>>>>           - Received address of new leader akka.tcp://flink@54.35.41.12:41292/user/jobmanager
>>>> with session ID null.
>>>> 13:57:37,543 INFO  org.apache.flink.yarn.ApplicationClient
>>>>           - Disconnect from JobManager null.
>>>> 13:57:37,545 INFO  org.apache.flink.yarn.ApplicationClient
>>>>           - Trying to register at JobManager akka.tcp:
>>>> //flink@54.35.41.12:41292/user/jobmanager.
>>>> No status updates from the YARN cluster received so far. Waiting ...
>>>>
>>>> The logs of the Jobmanager contains the following -
>>>>
>>>> 21:57:39,142 ERROR akka.remote.EndpointWriter                           
        - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://flink@54.35.41.12:41292/]] arriving at [akka.tcp://flink@54.35.41.12:41292]
inbound addresses are [akka.tcp://flink@172.31.23.18:41292]
>>>> 21:57:40,782 INFO  org.apache.flink.runtime.instance.InstanceManager    
        - Registered TaskManager at ec2-54-35-41-12 (akka.tcp://flink@172.31.23.18:60565/user/taskmanager)
as 72101dd2ee94caa7a5ec5a75488359aa. Current number of registered hosts is 1. Current number
of alive task slots is 1.
>>>> 21:57:41,162 ERROR akka.remote.EndpointWriter                           
        - dropping message [class akka.actor.ActorSelectionMessage] for non-local recipient
[Actor[akka.tcp://flink@54.35.41.12:41292/]] arriving at [akka.tcp://flink@54.35.41.12:41292]
inbound addresses are [akka.tcp://flink@172.31.23.18:41292]
>>>>
>>>> It seems the problem is in the mismatch of the Jobmanager Akka actors
>>>> system running address and the one user by the Client.
>>>> 172.31.23.18 – is the internal private IP of the EC2 machine where the
>>>> Jobmanager container is running.
>>>> 54.35.41.12 – is the external IP of the EC2 machine, used by Flink
>>>> client to submit the Job.
>>>> Because of this mismatch the messages are ignored by the Akka actor
>>>> System.
>>>>
>>>> Can someone please help me with this issue.
>>>> I can share the detailed logs, if required.
>>>>
>>>> Thanks,
>>>> Abhi
>>>>
>>>>
>>>
>>
>

Mime
View raw message