From user-return-23533-archive-asf-public=cust-asf.ponee.io@flink.apache.org  Tue Oct  9 18:09:25 2018
Return-Path: <user-return-23533-archive-asf-public=cust-asf.ponee.io@flink.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 17F5118067A
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  9 Oct 2018 18:09:24 +0200 (CEST)
Received: (qmail 14435 invoked by uid 500); 9 Oct 2018 16:09:23 -0000
Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@flink.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@flink.apache.org>
List-Post: <mailto:user@flink.apache.org>
List-Id: <user.flink.apache.org>
Delivered-To: mailing list user@flink.apache.org
Received: (qmail 14081 invoked by uid 99); 9 Oct 2018 16:09:23 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Oct 2018 16:09:23 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id F10E2C1DDE
	for <user@flink.apache.org>; Tue,  9 Oct 2018 16:09:22 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1
X-Spam-Level: *
X-Spam-Status: No, score=1 tagged_above=-999 required=6.31
	tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_NONE=-0.0001]
	autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id qXMO4DE28QPu for <user@flink.apache.org>;
	Tue,  9 Oct 2018 16:09:21 +0000 (UTC)
Received: from smtp5-g21.free.fr (smtp5-g21.free.fr [212.27.42.5])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 49E9E5F332
	for <user@flink.apache.org>; Tue,  9 Oct 2018 16:09:21 +0000 (UTC)
Received: from zimbra50-e8.priv.proxad.net (unknown [172.20.243.200])
	by smtp5-g21.free.fr (Postfix) with ESMTP id 7BE1E5FFD7;
	Tue,  9 Oct 2018 18:09:20 +0200 (CEST)
Date: Tue, 9 Oct 2018 18:09:20 +0200 (CEST)
From: jpreisner@free.fr
To: Piotr Nowojski <piotr@data-artisans.com>
Cc: user@flink.apache.org
Message-ID: <800672508.350438702.1539101360396.JavaMail.root@zimbra50-e8.priv.proxad.net>
In-Reply-To: <E7ABE991-4F1B-40CA-BD92-34634F80BA84@data-artisans.com>
Subject: Re: JobManager did not respond within 60000 ms
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Originating-IP: [81.255.194.221]
X-Mailer: Zimbra 7.2.0-GA2598 (ZimbraWebClient - GC69 (Win)/7.2.0-GA2598)
X-Authenticated-User: jpreisner@free.fr

Hi Piotrek,

Thank you for your answer. Actually it was necessary to increase the memory=
 of the JobManager (I had tested it but I had not restarted Flink ...).

I will also work on optimization. I thought it was good practice to create =
as much function as possible based on their functional value (for example: =
create two FilterFunctions that have a different functional meaning). So I =
will try to have fewer functions (for example: gather my two FilterFunction=
s in one).

Thanks again Piotrek !

Julien.

----- Mail original -----
De: "Piotr Nowojski" <piotr@data-artisans.com>
=C3=80: jpreisner@free.fr
Cc: user@flink.apache.org
Envoy=C3=A9: Mardi 9 Octobre 2018 10:37:58
Objet: Re: JobManager did not respond within 60000 ms

Hi,=20


You have quite complicated job graph and very low memory settings for the j=
ob manager and task manager. It might be that long GC pauses are causing th=
is problem.=20


Secondly, there are quite some results in google search of this error that =
points toward high-availability issues. Have you read those previously repo=
rted problems?=20


Thanks, Piotrek=20


On 9 Oct 2018, at 09:57, jpreisner@free.fr wrote:=20


I have a streaming job that works in standalone cluster. Flink version is 1=
.4.1. Everything was working so far. But since I added new treatments, I ca=
n not start my job anymore. I have this exception :=20

org.apache.flink.client.program.ProgramInvocationException: The program exe=
cution failed: JobManager did not respond within 60000 ms=20
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.=
java:524)=20
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(Standa=
loneClusterClient.java:103)=20
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456=
)=20
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(Deta=
chedEnvironment.java:77)=20
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:402=
)=20
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)=
=20
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)=20
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:105=
4)=20
at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)=20
at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)=20
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSec=
urityContext.java:30)=20
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)=20
Caused by: org.apache.flink.runtime.client.JobTimeoutException: JobManager =
did not respond within 60000 ms=20
at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.ja=
va:437)=20
at org.apache.flink.client.program.ClusterClient.runDetached(ClusterClient.=
java:516)=20
... 11 more=20
Caused by: java.util.concurrent.TimeoutException=20
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1=
771)=20
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)=
=20
at org.apache.flink.runtime.client.JobClient.submitJobDetached(JobClient.ja=
va:435)=20
... 12 more=20

I see a very strange behavior. When I comment on a function (any one, for e=
xample a FilterFunction, which was present before or after my modification)=
.=20
I tried to change the configuration (akka.client.timeout and akka.framesize=
) without success.=20

This is my flink-conf.yaml=20
jobmanager.rpc.address: myhost=20
jobmanager.rpc.port: 6123=20
jobmanager.heap.mb: 128=20
taskmanager.heap.mb: 1024=20
taskmanager.numberOfTaskSlots: 100=20
taskmanager.memory.preallocate: false=20
taskmanager.data.port: 6121=20
parallelism.default: 1=20
taskmanager.tmp.dirs: /dohdev/flink/tmp/tskmgr=20
blob.storage.directory: /dohdev/flink/tmp/blob=20
jobmanager.web.port: -1=20
high-availability: zookeeper=20
high-availability.zookeeper.quorum: localhost:2181=20
high-availability.zookeeper.path.root: /dohdev/flink=20
high-availability.cluster-id: dev=20
high-availability.storageDir: file:////mnt/metaflink=20
high-availability.zookeeper.storageDir: /mnt/metaflink/inh/agregateur/recov=
ery=20
restart-strategy: fixed-delay=20
restart-strategy.fixed-delay.attempts: 1000=20
restart-strategy.fixed-delay.delay: 5 s=20
zookeeper.sasl.disable: true=20
blob.service.cleanup.interval: 60=20

And I launch a job with this command : bin/flink run -d myjar.jar=20

I added as an attachment a graph of my job when it works (Graph.PNG).=20

Do you have an idea of the problem ?=20

Thanks.=20
Julien=20


<Graph.PNG>=20