From user-return-23533-archive-asf-public=cust-asf.ponee.io@flink.apache.org Tue Oct 9 18:09:25 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 17F5118067A for ; Tue, 9 Oct 2018 18:09:24 +0200 (CEST) Received: (qmail 14435 invoked by uid 500); 9 Oct 2018 16:09:23 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@flink.apache.org Received: (qmail 14081 invoked by uid 99); 9 Oct 2018 16:09:23 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2018 16:09:23 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id F10E2C1DDE for ; Tue, 9 Oct 2018 16:09:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1 X-Spam-Level: * X-Spam-Status: No, score=1 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id qXMO4DE28QPu for ; Tue, 9 Oct 2018 16:09:21 +0000 (UTC) Received: from smtp5-g21.free.fr (smtp5-g21.free.fr [212.27.42.5]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 49E9E5F332 for ; Tue, 9 Oct 2018 16:09:21 +0000 (UTC) Received: from zimbra50-e8.priv.proxad.net (unknown [172.20.243.200]) by smtp5-g21.free.fr (Postfix) with ESMTP id 7BE1E5FFD7; Tue, 9 Oct 2018 18:09:20 +0200 (CEST) Date: Tue, 9 Oct 2018 18:09:20 +0200 (CEST) From: jpreisner@free.fr To: Piotr Nowojski Cc: user@flink.apache.org Message-ID: <800672508.350438702.1539101360396.JavaMail.root@zimbra50-e8.priv.proxad.net> In-Reply-To: Subject: Re: JobManager did not respond within 60000 ms MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Originating-IP: [81.255.194.221] X-Mailer: Zimbra 7.2.0-GA2598 (ZimbraWebClient - GC69 (Win)/7.2.0-GA2598) X-Authenticated-User: jpreisner@free.fr Hi Piotrek, Thank you for your answer. Actually it was necessary to increase the memory= of the JobManager (I had tested it but I had not restarted Flink ...). I will also work on optimization. I thought it was good practice to create = as much function as possible based on their functional value (for example: = create two FilterFunctions that have a different functional meaning). So I = will try to have fewer functions (for example: gather my two FilterFunction= s in one). Thanks again Piotrek ! Julien. ----- Mail original ----- De: "Piotr Nowojski" =C3=80: jpreisner@free.fr Cc: user@flink.apache.org Envoy=C3=A9: Mardi 9 Octobre 2018 10:37:58 Objet: Re: JobManager did not respond within 60000 ms Hi,=20 You have quite complicated job graph and very low memory settings for the j= ob manager and task manager. It might be that long GC pauses are causing th= is problem.=20 Secondly, there are quite some results in google search of this error that = points toward high-availability issues. Have you read those previously repo= rted problems?=20 Thanks, Piotrek=20 On 9 Oct 2018, at 09:57, jpreisner@free.fr wrote:=20 I have a streaming job that works in standalone cluster. Flink version is 1= .4.1. Everything was working so far. But since I added new treatments, I ca= n not start my job anymore. I have this exception :=20 org.apache.flink.client.program.ProgramInvocationException: The program exe= cution failed: JobManager did not respond within 60000 ms=20 at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.= java:524)=20 at org.apache.flink.client.program.StandaloneClusterClient.submitJob(Standa= loneClusterClient.java:103)=20 at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456= )=20 at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(Deta= chedEnvironment.java:77)=20 at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:402= )=20 at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)= =20 at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)=20 at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:105= 4)=20 at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)=20 at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)=20 at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSec= urityContext.java:30)=20 at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)=20 Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager = did not respond within 60000 ms=20 at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.ja= va:437)=20 at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.= java:516)=20 ... 11 more=20 Caused by: java.util.concurrent.TimeoutException=20 at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1= 771)=20 at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)= =20 at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.ja= va:435)=20 ... 12 more=20 I see a very strange behavior. When I comment on a function (any one, for e= xample a FilterFunction, which was present before or after my modification)= .=20 I tried to change the configuration (akka.client.timeout and akka.framesize= ) without success.=20 This is my flink-conf.yaml=20 jobmanager.rpc.address: myhost=20 jobmanager.rpc.port: 6123=20 jobmanager.heap.mb: 128=20 taskmanager.heap.mb: 1024=20 taskmanager.numberOfTaskSlots: 100=20 taskmanager.memory.preallocate: false=20 taskmanager.data.port: 6121=20 parallelism.default: 1=20 taskmanager.tmp.dirs: /dohdev/flink/tmp/tskmgr=20 blob.storage.directory: /dohdev/flink/tmp/blob=20 jobmanager.web.port: -1=20 high-availability: zookeeper=20 high-availability.zookeeper.quorum: localhost:2181=20 high-availability.zookeeper.path.root: /dohdev/flink=20 high-availability.cluster-id: dev=20 high-availability.storageDir: file:////mnt/metaflink=20 high-availability.zookeeper.storageDir: /mnt/metaflink/inh/agregateur/recov= ery=20 restart-strategy: fixed-delay=20 restart-strategy.fixed-delay.attempts: 1000=20 restart-strategy.fixed-delay.delay: 5 s=20 zookeeper.sasl.disable: true=20 blob.service.cleanup.interval: 60=20 And I launch a job with this command : bin/flink run -d myjar.jar=20 I added as an attachment a graph of my job when it works (Graph.PNG).=20 Do you have an idea of the problem ?=20 Thanks.=20 Julien=20 =20