From user-return-35976-archive-asf-public=cust-asf.ponee.io@flink.apache.org Tue Jun 23 03:08:16 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8AC8318057A for ; Tue, 23 Jun 2020 05:08:15 +0200 (CEST) Received: (qmail 38536 invoked by uid 500); 23 Jun 2020 03:08:12 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 38515 invoked by uid 99); 23 Jun 2020 03:08:11 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jun 2020 03:08:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id C2D98C017D for ; Tue, 23 Jun 2020 03:08:10 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.01 X-Spam-Level: * X-Spam-Status: No, score=1.01 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_REPLY=1, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id RDOQh5JOM0gA for ; Tue, 23 Jun 2020 03:08:08 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::432; helo=mail-wr1-x432.google.com; envelope-from=danrtsey.wy@gmail.com; receiver= Received: from mail-wr1-x432.google.com (mail-wr1-x432.google.com [IPv6:2a00:1450:4864:20::432]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 0EE6B7D3FB for ; Tue, 23 Jun 2020 03:08:07 +0000 (UTC) Received: by mail-wr1-x432.google.com with SMTP id k6so6436694wrn.3 for ; Mon, 22 Jun 2020 20:08:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=bU/79NCwa+4Z4E7EzyM9xmIU/3vwfZKKQ908gnzGBDQ=; b=PTzi/ymwSrsusxp7c9ssQduLUZnaMY7Rwy7Z3WhOPiOdAsq/WgwWu2pGIqiYiAMHRV MuTFcU2Q1hGC8AftsbRZVkCc/5peBWUtl2f1HK5ZVjc8QFdl3WVgd84WjnO7nOBJV4Ms lZykAFmvwVOMnMS6NYW1pwVW2VDsvtn0u6W9f1sITVgYyqfChPmkbq2lIzOEBp+HKhYU sbVCZm1CaRLD2W6qPUGH3EytKOzs5ZkjFvARhaoDbkC0Sls9BVzjzEmDPsvQLqBseEIy KLPMPJ1ECken8Rh70/EVRGUuExKEZzBFRa54ytT+7sQLKlAmRu8j1rzzrTEKL0kzuNLW UC8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=bU/79NCwa+4Z4E7EzyM9xmIU/3vwfZKKQ908gnzGBDQ=; b=j7HO8fryqdNeesZVPsweIV0CL6JhhcefW8THec/ygfAu+Rn2lyw0e0E/Mahn+9EiuC hoB77lezkM1goq9OUP2OrZlC+VSK1ElLYUh9vbf10XIIRSP5eNCPu1fGfjEwvAfZ0dFe jGFcht3AWAlHj092a+hCjow28mTHLlul1SesdKWl05e5bV/dOTXBj1HWEwkCH5Qd29dy IXYLdp9nBnfXCmvanVNtZ7Bu+u+sVvPw2deL9dJehu1YFGQJgXpGp/vgGieeTwgNmoNC jHjjuly7PJxjLKy6B9DoUlIX2nQ4G8PCM1EraPr175cW70jTLgoIsKqkrSnSbvjbnove ogMw== X-Gm-Message-State: AOAM533eZqzz2alRCEe60ni1ZGq6ZJUgAxhya+4SlAwIhbawe5yL4CNd +I9YeefEIUil7evkkPJw+5uUNsQ2Vz9VwLhvlqY= X-Google-Smtp-Source: ABdhPJxJPVWmm4pFO2pgXfFAM8+Knl6zpp4agsQz2Si+yUJpsdTKGewa6P+VPULlUQFD6Ggq6yOyyw1IVTuiKsBidCk= X-Received: by 2002:adf:fa89:: with SMTP id h9mr22304838wrr.120.1592881687410; Mon, 22 Jun 2020 20:08:07 -0700 (PDT) MIME-Version: 1.0 References: <1117643343.123805.1592489700580.ref@mail.yahoo.com> <1117643343.123805.1592489700580@mail.yahoo.com> In-Reply-To: From: Yang Wang Date: Tue, 23 Jun 2020 11:07:48 +0800 Message-ID: Subject: Re: Submitted Flink Jobs EMR are failing (Could not start rest endpoint on any port in port range 8081) To: Arvid Heise Cc: "sk_acura@yahoo.com" , "user@flink.apache.org" Content-Type: multipart/alternative; boundary="00000000000025b9b905a8b7ab66" --00000000000025b9b905a8b7ab66 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Sateesh, if the "rest.port" or "rest.bind-port" is configured explicitly, it will be used to start the rest server. So you need to remove them from the flink-conf.yaml or configure them to "0" or port range(50100-50200). By default, "flink run" will always start a dedicated Flink cluster for each job. If you want to use session mode, you need to start with "yarn-session.sh" first. And then use "flink run ... -yid application_id" to submit a Flink job to existing cluster. Best, Yang Arvid Heise =E4=BA=8E2020=E5=B9=B46=E6=9C=8822=E6=97= =A5=E5=91=A8=E4=B8=80 =E4=B8=8B=E5=8D=889:58=E5=86=99=E9=81=93=EF=BC=9A > Hi Sateesh, > > the solution still applies, there are not all entries listed in the conf > template. > > From what you have written, it's most certainly that the first jobs are > not finished (hence port is taken). Make sure you don't use the detached > mode when submitting. > You can see the status of the jobs in YARN resource manager which also > links to the respective Flink JobManagers. > > And yes, by default, each job creates a new YARN session unless you use > them explicitly [1]. > > If you need more help, please post your steps. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/ops/deployment/yar= n_setup.html#flink-yarn-session > > On Thu, Jun 18, 2020 at 4:15 PM sk_acura@yahoo.com > wrote: > >> I am using EMR 5.30.0 and trying to submit a Flink (1.10.0) job using th= e >> following command >> >> flink run -m yarn-cluster /home/hadoop/flink--test-0.0.1-SNAPSHOT.jar >> >> and i am getting the following error: >> >> Caused by: >> org.apache.flink.yarn.YarnClusterDescriptor$YarnDeploymentException: The >> YARN application unexpectedly switched to state FAILED during deployment= . >> >> After going through the logs on the worker nodes and job manager logs it >> looks like there is a port conflict >> >> 2020-06-17 21:40:51,199 ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could no= t >> start cluster entrypoint YarnJobClusterEntrypoint. >> org.apache.flink.runtime.entrypoint.ClusterEntrypointException: >> Failed to initialize the cluster entrypoint YarnJobClusterEntrypoint. >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(Clust= erEntrypoint.java:187) >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoi= nt(ClusterEntrypoint.java:518) >> at >> org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobCl= usterEntrypoint.java:119) >> Caused by: org.apache.flink.util.FlinkException: Could not create th= e >> DispatcherResourceManagerComponent. >> at >> org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceM= anagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFact= ory.java:261) >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(Cluster= Entrypoint.java:215) >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluste= r$0(ClusterEntrypoint.java:169) >> at java.security.AccessController.doPrivileged(Native Method= ) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio= n.java:1844) >> at >> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(Hadoo= pSecurityContext.java:41) >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(Clust= erEntrypoint.java:168) >> ... 2 more >> Caused by: java.net.BindException: Could not start rest endpoint on >> any port in port range 8081 >> at >> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoin= t.java:219) >> at >> org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceM= anagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFact= ory.java:165) >> ... 9 more >> >> There seems to be JIRA Ticket ( >> https://issues.apache.org/jira/browse/FLINK-15394) open for this (though >> it is for 1.9 version of Flink) and the suggested solution is to use por= t >> range for **rest.bind-port** in Flink config File. >> >> How ever in 1.10 version of Flink we only the following the the Yan Conf >> YML File >> >> rest.port: 8081 >> >> Another issue i am facing is i have submitted multiple Flink jobs (same >> job multiple times) using AWS Console and via Add Step ui. Only one of t= he >> job succeeded and the rest have failed with the error posted above. And >> when i go to Flink UI it doesn't show any jobs at all. >> >> Wondering whether each of the submitted jobs trying to create a Flink >> Yarn session instead of using the existing one. >> >> Thanks >> Sateesh >> >> > > -- > > Arvid Heise | Senior Java Developer > > > > Follow us @VervericaData > > -- > > Join Flink Forward - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Ververica GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji > (Toni) Cheng > --00000000000025b9b905a8b7ab66 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Sateesh, if the "rest.port" or "rest.bin= d-port" is configured explicitly, it will be used to
start the res= t server. So you need to remove them from the flink-conf.yaml or configure = them
to "0" or port range(50100-50200).

<= /div>
By default, "flink run" will always start a dedicated F= link cluster for each job. If you want to use
session mode, you n= eed to start with "yarn-session.sh" first. And then use "fli= nk run ... -yid application_id"
to submit a Flink job to exi= sting cluster.


Best,
Yang=

Arvid Heise <arvid@ververic= a.com> =E4=BA=8E2020=E5=B9=B46=E6=9C=8822=E6=97=A5=E5=91=A8=E4=B8=80= =E4=B8=8B=E5=8D=889:58=E5=86=99=E9=81=93=EF=BC=9A
Hi Sateesh,
<= div>
the solution still applies, there are not all entries li= sted in the conf template.

From what you have writ= ten, it's most certainly that the first jobs are not finished (hence po= rt is taken). Make sure you don't use the detached mode when submitting= .
You can see the status of the jobs in YARN resource manager whi= ch also links to the respective Flink JobManagers.

And yes, by default, each job creates a new YARN session unless you use th= em explicitly [1].

If you need more help, ple= ase post your steps.


On Thu, Jun 18, 2020 at 4:15 PM sk_acura@yahoo.com <sk_acura@yahoo.com> wrote:
I am using EMR 5.30.0 and trying to submit a Flink (1.1= 0.0) job using the following command

flink run -m = yarn-cluster /home/hadoop/flink--test-0.0.1-SNAPSHOT.jar

and i am getting the following error:

=C2= =A0 =C2=A0 Caused by: org.apache.flink.yarn.YarnClusterDescriptor$YarnDeplo= ymentException: The YARN application unexpectedly switched to state FAILED = during deployment.=C2=A0

After going through the l= ogs on the worker nodes and job manager logs it looks like there is a port = conflict

=C2=A0 =C2=A0 2020-06-17 21:40:51,199 ERR= OR org.apache.flink.runtime.entrypoint.ClusterEntrypoint=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0- Could not start cluster entrypoint YarnJobClusterEntrypo= int.
=C2=A0 =C2=A0 org.apache.flink.runtime.entrypoint.ClusterEnt= rypointException: Failed to initialize the cluster entrypoint YarnJobCluste= rEntrypoint.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apa= che.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoi= nt.java:187)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apa= che.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(Cluster= Entrypoint.java:518)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at= org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint.main(YarnJobClus= terEntrypoint.java:119)
=C2=A0 =C2=A0 Caused by: org.apache.flink= .util.FlinkException: Could not create the DispatcherResourceManagerCompone= nt.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink= .runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFact= ory.create(DefaultDispatcherResourceManagerComponentFactory.java:261)
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.= entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.en= trypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169= )
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at java.security.Acce= ssController.doPrivileged(Native Method)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 at javax.security.auth.Subject.doAs(Subject.java:422)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.hadoop.secur= ity.UserGroupInformation.doAs(UserGroupInformation.java:1844)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.security= .HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.entry= point.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ... 2 more
=C2=A0 =C2= =A0 Caused by: java.net.BindException: Could not start rest endpoint on any= port in port range 8081
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 at org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEnd= point.java:219)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.= apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerC= omponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.jav= a:165)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 ... 9 more
=

There seems to be JIRA Ticket (https://issues.apac= he.org/jira/browse/FLINK-15394) open for this (though it is for 1.9 ver= sion of Flink) and the suggested solution is to use port range for **rest.b= ind-port** in Flink config File.

How ever in 1.10 = version of Flink we only the following the the Yan Conf YML File
=
=C2=A0 =C2=A0 rest.port: 8081

Anoth= er issue i am facing is i have submitted multiple Flink jobs (same job mult= iple times) using AWS Console and via Add Step ui. Only one of the job succ= eeded and the rest have failed with the error posted above. And when i go t= o Flink UI it doesn't show any jobs at all.

Wo= ndering whether each of the submitted jobs trying to create a Flink Yarn se= ssion instead of using the existing one.

Thanks
Sateesh



--

Arvid Heise <= span style=3D"font-size:10pt;font-family:Roboto,sans-serif;color:rgb(0,0,0)= ;background-color:transparent;font-weight:400;font-style:normal;font-varian= t:normal;text-decoration:none;vertical-align:baseline;white-space:pre-wrap"= >| Senior Java Developer


Follow us= @VervericaData

--

Join Flink Forward - The Apache Flink Conference

Stream Processi= ng | Event Driven | Real Time

--

Ververica = GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--Ververica GmbH
Registered at Amtsgericht = Charlottenburg: HRB 158244 B
Managing Directors: Timot= hy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng=C2=A0= =C2=A0=C2=A0

--00000000000025b9b905a8b7ab66--