From issues-return-151369-archive-asf-public=cust-asf.ponee.io@flink.apache.org Wed Feb 7 11:56:06 2018 Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id 3E29118065B for ; Wed, 7 Feb 2018 11:56:06 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 29A0C160C5B; Wed, 7 Feb 2018 10:56:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 22454160C3C for ; Wed, 7 Feb 2018 11:56:04 +0100 (CET) Received: (qmail 3634 invoked by uid 500); 7 Feb 2018 10:56:04 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 3623 invoked by uid 99); 7 Feb 2018 10:56:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Feb 2018 10:56:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id C86BDC0040 for ; Wed, 7 Feb 2018 10:56:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -110.311 X-Spam-Level: X-Spam-Status: No, score=-110.311 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id gqh4GDfc9Nb7 for ; Wed, 7 Feb 2018 10:56:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C60DF5F11F for ; Wed, 7 Feb 2018 10:56:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 4A9DEE0047 for ; Wed, 7 Feb 2018 10:56:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 10DD521E85 for ; Wed, 7 Feb 2018 10:56:00 +0000 (UTC) Date: Wed, 7 Feb 2018 10:56:00 +0000 (UTC) From: "Arnaud Linz (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (FLINK-8580) No easy way (or issues when trying?) to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Arnaud Linz created FLINK-8580: ---------------------------------- Summary: No easy way (or issues when trying?) to handle multip= le yarn sessions and choose at runtime the one to submit a ha streaming job Key: FLINK-8580 URL: https://issues.apache.org/jira/browse/FLINK-8580 Project: Flink Issue Type: Improvement Components: Job-Submission Affects Versions: 1.3.2 Reporter: Arnaud Linz Hello, I am using Flink 1.3.2 and I=E2=80=99m struggling to achieve something that= should be simple. For isolation reasons, I want to start multiple long living yarn session co= ntainers (with the same user) and choose at run-time, when I start a HA str= eaming app, which container will hold it. I start my yarn session with the command line option=C2=A0: -Dyarn.properti= es-file.location=3Dmydir The session is created and a .yarn-properties-$USER file is generated. And I=E2=80=99ve tried the following to submit my job: =C2=A0 *CASE 1*=20 *flink-conf.yaml* : yarn.properties-file.location: mydir *flink run options* : none * Uses zookeeper and works=C2=A0 =E2=80=93 but I cannot choose the contain= er as the property file is global. =C2=A0**=C2=A0 *CASE 2* *flink-conf.yaml* : nothing *flink run options* : -yid applicationId * Do not use zookeeper, tries to connect to yarn job manager but fails in = =E2=80=9CJob submission to the JobManager timed out=E2=80=9D error =C2=A0**=C2=A0 *CASE 3* *flink-conf.yaml* : nothing *flink run options* : -yid applicationId and -yD with all dynamic propertie= s found in the =E2=80=9CdynamicPropertiesString=E2=80=9D of .yarn-propertie= s-$USER file * Same as case 2 =C2=A0**=C2=A0 *CASE 4* *flink-conf.yaml* : nothing *flink run options* : -yD yarn.properties-file.location=3Dmydir * Tries to connect to local (non yarn) job manager (and fails) =C2=A0**=C2=A0 *CASE 5* Even weirder: *flink-conf.yaml* : yarn.properties-file.location: mydir *flink run options* : -yD yarn.properties-file.location=3Dmydir * Still tries to connect to local (non yarn) job manager! =C2=A0 Without any other solution, I've made a shell script that copies the origin= al content of FLINK_CONF_DIR in a temporary dir, modify flink-conf.yaml to = set yarn.properties-file.location, and change FLINK_CONF_DIR to that temp d= ir before executing flink to submit the job. I am now able to select the container I want, but I think it should be made= simpler=E2=80=A6 =C2=A0 Logs extracts: *CASE 1:* {{2018:02:01 15:43:20 - Waiting until all TaskManagers have connected}}{{20= 18:02:01 15:43:20 - Starting client actor system.}}{{2018:02:01 15:43:20 - = Starting ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:43:20 - Trying t= o select the network interface and address to use by connecting to the lead= ing JobManager.}}{{2018:02:01 15:43:20 - TaskManager will try to connect fo= r 10000 milliseconds before falling back to heuristics}}{{2018:02:01 15:43:= 21 - Retrieved new target address elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr= /10.136.170.193:33970.}}{{2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRet= rievalService.}}{{2018:02:01 15:43:21 - Slf4jLogger started}}{{2018:02:01 1= 5:43:21 - Starting remoting}}{{2018:02:01 15:43:21 - Remoting started; list= ening on addresses :[akka.tcp://flink@elara-edge-u2-n01.dev.mlb.jupiter.nby= t.fr:36340]}}{{2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalServi= ce.}}{{2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.}}{{2= 018:02:01 15:43:21 - TaskManager status (2/1)}}{{2018:02:01 15:43:21 - All = TaskManagers are connected}}{{2018:02:01 15:43:21 - Submitting job with Job= ID: f69197b0b80a76319a87bde10c1e3f77. Waiting for job completion.}}{{2018:0= 2:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService.}}{{2018:02:01 15:= 43:21 - Received SubmitJobAndWait(JobGraph(jobId: f69197b0b80a76319a87bde10= c1e3f77)) but there is no connection to a JobManager yet.}}{{2018:02:01 15:= 43:21 - Received job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77).}}{= {2018:02:01 15:43:21 - Disconnect from JobManager null.}}{{2018:02:01 15:43= :21 - Connect to JobManager Actor[akka.tcp://flink@elara-data-u2-n01.dev.ml= b.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].}}{{2018:02:01 15:43:2= 1 - Connected to JobManager at Actor[akka.tcp://flink@elara-data-u2-n01.dev= .mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245] with leader session= id 388af5b8-5555-4923-8ee4-8a4b9bfbb0b9.}}{{2018:02:01 15:43:21 - Sending = message to JobManager akka.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nb= yt.fr:33970/user/jobmanager to submit job SND-IMP-SIGNAST (f69197b0b80a7631= 9a87bde10c1e3f77) and wait for progress}}{{2018:02:01 15:43:21 - Upload jar= files to job manager akka.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nb= yt.fr:33970/user/jobmanager.}}{{2018:02:01 15:43:21 - Blob client connectin= g to akka.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/= jobmanager}}{{2018:02:01 15:43:22 - Submit job to the job manager akka.tcp:= //flink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{= {2018:02:01 15:43:22 - Job f69197b0b80a76319a87bde10c1e3f77 was successfull= y submitted to the JobManager akka://flink/deadLetters.}}{{2018:02:01 15:43= :22 - 02/01/2018 15:43:22=C2=A0=C2=A0 Job execution switched to status RUNN= ING.}} =C2=A0 *CASE 2:* {{2018:02:01 15:48:43 - Waiting until all TaskManagers have connected}}{{20= 18:02:01 15:48:43 - Starting client actor system.}}{{2018:02:01 15:48:43 - = Trying to select the network interface and address to use by connecting to = the leading JobManager.}}{{2018:02:01 15:48:43 - TaskManager will try to co= nnect for 10000 milliseconds before falling back to heuristics}}{{2018:02:0= 1 15:48:43 - Retrieved new target address elara-data-u2-n01.dev.mlb.jupiter= .nbyt.fr/10.136.170.193:33970.}}{{2018:02:01 15:48:43 - Slf4jLogger started= }}{{2018:02:01 15:48:43 - Starting remoting}}{{2018:02:01 15:48:43 - Remoti= ng started; listening on addresses :[akka.tcp://flink@elara-edge-u2-n01.dev= .mlb.jupiter.nbyt.fr:34140]}}{{2018:02:01 15:48:43 - TaskManager status (2/= 1)}}{{2018:02:01 15:48:43 - All TaskManagers are connected}}{{2018:02:01 15= :48:43 - Submitting job with JobID: cd3e0e223c57d01d415fe7a6a308576c. Waiti= ng for job completion.}}{{2018:02:01 15:48:43 - Received SubmitJobAndWait(J= obGraph(jobId: cd3e0e223c57d01d415fe7a6a308576c)) but there is no connectio= n to a JobManager yet.}}{{2018:02:01 15:48:43 - Received job SND-IMP-SIGNAS= T (cd3e0e223c57d01d415fe7a6a308576c).}}{{2018:02:01 15:48:43 - Disconnect f= rom JobManager null.}}{{2018:02:01 15:48:43 - Connect to JobManager Actor[a= kka.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobman= ager#-1554418245].}}{{2018:02:01 15:48:43 - Connected to JobManager at Acto= r[akka.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/job= manager#-1554418245] with leader session id 00000000-0000-0000-0000-0000000= 00000.}}{{2018:02:01 15:48:43 - Sending message to JobManager akka.tcp://fl= ink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager to subm= it job SND-IMP-SIGNAST (cd3e0e223c57d01d415fe7a6a308576c) and wait for prog= ress}}{{2018:02:01 15:48:43 - Upload jar files to job manager akka.tcp://fl= ink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.}}{{201= 8:02:01 15:48:43 - Blob client connecting to akka.tcp://flink@elara-data-u2= -n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager}}{{2018:02:01 15:48:45 -= Submit job to the job manager akka.tcp://flink@elara-data-u2-n01.dev.mlb.j= upiter.nbyt.fr:33970/user/jobmanager.}}{{2018:02:01 15:49:45 - Terminate Jo= bClientActor.}}{{2018:02:01 15:49:45 - Disconnect from JobManager Actor[akk= a.tcp://flink@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanag= er#-1554418245].}} =C2=A0 Then {{Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeou= tException: Job submission to the JobManager timed out. You may increase 'a= kka.client.timeout' in case the JobManager needs more time to configure and= confirm the job submission.}}{{=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0= at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMes= sage(JobSubmissionClientActor.java:119)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 at org.apache.flink.runtime.client.JobClientActor.handleMessag= e(JobClientActor.java:251)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at= org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(Flin= kUntypedActor.java:89)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org= .apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.ja= va:68)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at akka.actor.UntypedA= ctor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)}}{{=C2=A0 =C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0at akka.actor.Actor$class.aroundReceive(Ac= tor.scala:467)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at akka.actor.= UntypedActor.aroundReceive(UntypedActor.scala:97)}}{{=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 at akka.actor.ActorCell.receiveMessage(ActorCell.scal= a:516)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at akka.actor.ActorCel= l.invoke(ActorCell.scala:487)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)}}{{=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at akka.dispatch.Mailbox.run(Mailbox.scala:2= 20)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at akka.dispatch.ForkJoin= ExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)}}{= {=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at scala.concurrent.forkjoin.Fo= rkJoinTask.doExec(ForkJoinTask.java:260)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(Fo= rkJoinPool.java:1339)}}{{=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at scal= a.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)}}{{=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at scala.concurrent.forkjoin.ForkJo= inWorkerThread.run(ForkJoinWorkerThread.java:107)}} =C2=A0 *CASE 3,4*=20 {{=C2=A0**=C2=A0}}{{2018:02:01 15:35:14 - Starting client actor system.}}{{= 2018:02:01 15:35:14 - Trying to select the network interface and address to= use by connecting to the leading JobManager.}}{{2018:02:01 15:35:14 - Task= Manager will try to connect for 10000 milliseconds before falling back to h= euristics}}{{2018:02:01 15:35:14 - Retrieved new target address localhost/1= 27.0.0.1:6123.}}{{2018:02:01 15:35:15 - Trying to connect to address localh= ost/127.0.0.1:6123}}{{2018:02:01 15:35:15 - Failed to connect from address = 'elara-edge-u2-n01/10.136.170.196': Connexion refus=C3=A9e (Connection refu= sed)}}{{2018:02:01 15:35:15 - Failed to connect from address '/127.0.0.1': = Connexion refus=C3=A9e (Connection refused)}}{{2018:02:01 15:35:15 - Failed= to connect from address '/192.168.117.1': Connexion refus=C3=A9e (Connecti= on refused)}}{{2018:02:01 15:35:15 - Failed to connect from address '/10.13= 6.170.225': Connexion refus=C3=A9e (Connection refused)}}{{2018:02:01 15:35= :15 - Failed to connect from address '/fe80:0:0:0:20c:29ff:fe8f:3fdd%ens192= ': Le r=C3=A9seau n'est pas accessible (connect failed)}}{{2018:02:01 15:35= :15 - Failed to connect from address '/10.136.170.196': Connexion refus=C3= =A9e (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect from a= ddress '/127.0.0.1': Connexion refus=C3=A9e (Connection refused)}}{{2018:02= :01 15:35:15 - Failed to connect from address '/192.168.117.1': Connexion r= efus=C3=A9e (Connection refused)}}{{2018:02:01 15:35:15 - Failed to connect= from address '/10.136.170.225': Connexion refus=C3=A9e (Connection refused= )}}{{2018:02:01 15:35:15 - Failed to connect from address '/fe80:0:0:0:20c:= 29ff:fe8f:3fdd%ens192': Le r=C3=A9seau n'est pas accessible (connect failed= )}}{{2018:02:01 15:35:15 - Failed to connect from address '/10.136.170.196'= : Connexion refus=C3=A9e (Connection refused)}}{{2018:02:01 15:35:15 - Fail= ed to connect from address '/127.0.0.1': Connexion refus=C3=A9e (Connection= refused)}} =C2=A0**=C2=A0 =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.3#76005)