Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 096A5200C25 for ; Fri, 10 Feb 2017 00:35:12 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 07EF4160B64; Thu, 9 Feb 2017 23:35:12 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 2B5C2160B50 for ; Fri, 10 Feb 2017 00:35:11 +0100 (CET) Received: (qmail 1114 invoked by uid 500); 9 Feb 2017 23:35:10 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 1104 invoked by uid 99); 9 Feb 2017 23:35:10 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 09 Feb 2017 23:35:10 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B7CBB18285E for ; Thu, 9 Feb 2017 23:35:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.88 X-Spam-Level: * X-Spam-Status: No, score=1.88 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 5GNTCSoQopV5 for ; Thu, 9 Feb 2017 23:35:07 +0000 (UTC) Received: from mail-wm0-f51.google.com (mail-wm0-f51.google.com [74.125.82.51]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E7D185F470 for ; Thu, 9 Feb 2017 23:35:06 +0000 (UTC) Received: by mail-wm0-f51.google.com with SMTP id v186so95902035wmd.0 for ; Thu, 09 Feb 2017 15:35:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=nLWzRTFgOWWU86t8bePBQkAelKi1/4fJj8bvmiZVkwg=; b=RUW4P7YqWPQC5XpMJIm77DSlaxSIARMNSIUnFpE4SzRMFa4tIWQpLDv3lmVpQMDZFc YuSc7EelW7FnL+xevXIyNeeGE8/+WyopGKcOv7J2KXi9oX3YvldhuuFIexbKjRaESZGF dVqezNktRH/Xk2+r75EQxmuUtofNQSOsMCTUc2vSi7mU4DgJe1sqAGOudgk4nDEW5cuc J1LpFF070bvXGIHl6QLoa2oiO16Hkpby0T2j0pW/BgJNzOV59VATKaPHnUpiX+dZDvAo EczyIiZ8cUZsRn5+1g7GiubePHzH8FCcFC+6XWMKrcoKmItjvBQn78Q82JhiiaO/4EE5 j8lQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=nLWzRTFgOWWU86t8bePBQkAelKi1/4fJj8bvmiZVkwg=; b=bREXbZoZzQbZGJgCPILQms1Gpne8z9hTxIsz871U96ROnB/2KAWfaSneOwfVf0P3fG h9NBm+Ki5za+OkBt5G6hsDl1bNwwWqMAZcRc3G33AhjMLeULCkX+byLNpIn1Mk6dtgC9 DeVyAwsKUbwyxl3JupiT0n9GQ998hE4IZdmsmQ24+T9yLo3n0Wu2U0KOWr1OsxVigjY1 GbTxeHPLLWu9et7OX3le1c3SrY0sF3Q3raOHOilZnBuAORcm9f6F1cS8kvD+R8hRRm8u Pp+XpnLVqKz2U8/z4tE6/WSc+SpMHLrwIP5dijXdLpeX4F9M8Y2pUNAmPYipeMtEGG6t Adtg== X-Gm-Message-State: AMke39nJ0XMYiNHfC/O2kkKQkNeNoSDPwqxtxtub4E3UmzaRM6OruVYGNLIio2dGeqdrQowplOwOwCEskexmGg== X-Received: by 10.28.191.79 with SMTP id p76mr23763511wmf.21.1486683302052; Thu, 09 Feb 2017 15:35:02 -0800 (PST) MIME-Version: 1.0 From: Geoffrey Mon Date: Thu, 09 Feb 2017 23:34:51 +0000 Message-ID: Subject: "Job submission to the JobManager timed out" on EMR YARN cluster with multiple jobs To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=94eb2c0712721ccb490548216e4a archived-at: Thu, 09 Feb 2017 23:35:12 -0000 --94eb2c0712721ccb490548216e4a Content-Type: text/plain; charset=UTF-8 Hello all, I'm running a Flink plan made up of multiple jobs. The source for my job can be found here if it would help in any way: https://github.com/quinngroup/flink-r1dl/blob/master/src/main/java/com/github/quinngroup/R1DL.java Each of the jobs (except for the first job) depends on files generated by the previous job; I'm running it on an AWS EMR cluster using YARN. When I submit the plan file, the first job runs as planned. After it completes, the second job is submitted by the YARN client: 02/09/2017 16:39:43 DataSink (CsvSink)(4/5) switched to FINISHED 02/09/2017 16:39:43 Job execution switched to status FINISHED. 2017-02-09 16:40:26,470 INFO org.apache.flink.yarn.YarnClusterClient - Waiting until all TaskManagers have connected Waiting until all TaskManagers have connected 2017-02-09 16:40:26,476 INFO org.apache.flink.yarn.YarnClusterClient - TaskManager status (5/5) TaskManager status (5/5) 2017-02-09 16:40:26,476 INFO org.apache.flink.yarn.YarnClusterClient - All TaskManagers are connected All TaskManagers are connected 2017-02-09 16:40:26,480 INFO org.apache.flink.yarn.YarnClusterClient - Submitting job with JobID: b226f5f18a78bc386bd1b1b6d30515ea. Waiting for job completion. Submitting job with JobID: b226f5f18a78bc386bd1b1b6d30515ea. Waiting for job completion. Connected to JobManager at Actor[akka.tcp://flink@ .ec2.internal:35598/user/jobmanager#68430682] If the input file is small and the first job runs quickly (~1 minute works for me), then the second job runs fine. However, if the input file for my first job is large and the first job takes more than a minute or so to complete, Flink will not acknowledge receiving the next job; the web Flink console does not show any new jobs and Flink logs do not mention receiving any new jobs after the first job has completed. The YARN client's job submission times out after Flink does not respond: Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out. You may increase 'akka.client.timeout' in case the JobManager needs more time to configure and confirm the job submission. at org.apache.flink.runtime.client.JobSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:119) at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:239) at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:88) at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167) I have tried increasing akka.client.timeout to large values such as 1200s (20 minutes), but even then Flink does not acknowledge or execute any other jobs and there is the same timeout error. Does anyone know how I can get Flink to execute all of the jobs properly? Cheers, Geoffrey Mon --94eb2c0712721ccb490548216e4a Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello all,

I'm running a Flink plan= made up of multiple jobs. The source for my job can be found here if it wo= uld help in any way:=C2=A0https://githu= b.com/quinngroup/flink-r1dl/blob/master/src/main/java/com/github/quinngroup= /R1DL.java
Each of the jobs (except for the first job) depend= s on files generated by the previous job; I'm running it on an AWS EMR = cluster using YARN.

When I submit the plan file, t= he first job runs as planned. After it completes, the second job is submitt= ed by the YARN client:

<snip>
02/09/2017 16:39:43 DataSink (CsvSink)(4/5) switched to FINISHED=C2=A0
02/0= 9/2017 16:39:43 Job execution switched to status FINISHED.
2017-02-09 16:40:= 26,470 INFO =C2=A0org.apache.flink.yarn.YarnClusterClient =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Waiting until= all TaskManagers have connected
Waiting until all TaskManagers h= ave connected
2017-02-09 16:40:26,476 INFO =C2=A0org.apache.flink= .yarn.YarnClusterClient =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 - TaskManager status (5/5)
TaskManage= r status (5/5)
2017-02-09 16:40:26,476 INFO =C2=A0org.apache.flin= k.yarn.YarnClusterClient =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 - All TaskManagers are connected
All = TaskManagers are connected
2017-02-09 16:40:26,480 INFO =C2=A0org= .apache.flink.yarn.YarnClusterClient =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Submitting job with JobID: b226f5f= 18a78bc386bd1b1b6d30515ea. Waiting for job completion.
Submitting= job with JobID: b226f5f18a78bc386bd1b1b6d30515ea. Waiting for job completi= on.
Connected to JobManager at Actor[akka.tcp://flink@<snip>= ;.ec2.internal:35598/user/jobmanager#68430682]

If the input file is small and the first job runs quickly (~1 minute wor= ks for me), then the second job runs fine. However, if the input file for m= y first job is large and the first job takes more than a minute or so to co= mplete, Flink will not acknowledge receiving the next job; the web Flink co= nsole does not show any new jobs and Flink logs do not mention receiving an= y new jobs after the first job has completed. The YARN client's job sub= mission times out after Flink does not respond:

Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeo= utException: Job submission to the JobManager timed out. You may increase &= #39;akka.client.timeout' in case the JobManager needs more time to conf= igure and confirm the job submission.
at org.apache.flink.runtime.client.Jo= bSubmissionClientActor.handleCustomMessage(JobSubmissionClientActor.java:11= 9)
at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClien= tActor.java:239)
at org.apache.flink.runtime.akka.FlinkUntypedActor.handleL= eaderSessionID(FlinkUntypedActor.java:88)
at org.apache.flink.runtime.akka.= FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)
at akka.actor.Untype= dActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)

I have tried increasing akka.client.timeout to large valu= es such as 1200s (20 minutes), but even then Flink does not acknowledge or = execute any other jobs and there is the same timeout error. Does anyone kno= w how I can get Flink to execute all of the jobs properly?

Cheers,
Geoffrey Mon
--94eb2c0712721ccb490548216e4a--