Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5D16718D1F for ; Tue, 21 Jul 2015 14:02:45 +0000 (UTC) Received: (qmail 11428 invoked by uid 500); 21 Jul 2015 14:02:37 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 11354 invoked by uid 500); 21 Jul 2015 14:02:37 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 11343 invoked by uid 99); 21 Jul 2015 14:02:37 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2015 14:02:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id B6DAB1A7591 for ; Tue, 21 Jul 2015 14:02:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.002 X-Spam-Level: **** X-Spam-Status: No, score=4.002 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id HAIgRukkudX4 for ; Tue, 21 Jul 2015 14:02:26 +0000 (UTC) Received: from mail-yk0-f176.google.com (mail-yk0-f176.google.com [209.85.160.176]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id E1B95428E7 for ; Tue, 21 Jul 2015 14:02:25 +0000 (UTC) Received: by ykay190 with SMTP id y190so165983653yka.3 for ; Tue, 21 Jul 2015 07:02:25 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=dAWNCg8u6+BIQf5Iheb86ujd071ju4uSH/xIQ2k6BXY=; b=RevePl0ROk6FhEpJwblISuR9XS97n8aeNtCuz+Ep9VI+mKy0b6TTO6bH86tJhzTeoV EEq6isSA/2dOkRSQa5FLCTQlOsr1t/4Jh1Md/vs8CbQTDxGnczTnag7zFtl5ZxcxF3J9 iG/9VPRiq0mx4/HBB2SE9xdQMg2kO5+u1fI7y/nvtcDhC24D7ZWsjehr6VvOP3+RWL/1 WiiSUSIBsYbVwmu8h/urxFfn69cFy17zPxv3t3DyMwcfVFFAcPAiU0433JAdX76iTo9V /X5HBJ3YakZNW0j/ZkNKD+ou615lrH6MB6PZxe42C57SlbLXcG4E3FtWiWS3xunnYqp6 LwqQ== X-Gm-Message-State: ALoCoQmSyLrExfCz4bQQLqhF3csDM3/5XclEHXopVeI6As2akCMvYrAGsptwNTPcodf7M5N0RVvj X-Received: by 10.13.200.67 with SMTP id k64mr34562141ywd.172.1437487345259; Tue, 21 Jul 2015 07:02:25 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.70.10 with HTTP; Tue, 21 Jul 2015 07:02:05 -0700 (PDT) X-Originating-IP: [213.203.177.29] In-Reply-To: References: From: Flavio Pompermaier Date: Tue, 21 Jul 2015 16:02:05 +0200 Message-ID: Subject: Re: JobManager is no longer reachable To: user Content-Type: multipart/alternative; boundary=001a114e4e8e95a1cc051b631a8b --001a114e4e8e95a1cc051b631a8b Content-Type: text/plain; charset=UTF-8 I think that the problem is that the error was caused by a class logging through java.utils.logging and in the have those logs working I had to put *SLF4JBridgeHandler.install();* at the beginning of the main(). Probably this should be documented..actually I don't know why this worked :) On Tue, Jul 21, 2015 at 3:29 PM, Stephan Ewen wrote: > Exceptions are swallowed upon canceling (because canceling has usually > followup exceptions). > > Root error cause exceptions should never be swallowed. > > Do you have a specific place in mind where that happens? > > On Mon, Jun 29, 2015 at 4:49 PM, Flavio Pompermaier > wrote: > >> I think that actually there's an Exception thrown within the code that I >> suspect it's not reported anywhere..could it be? >> >> On Mon, Jun 29, 2015 at 3:28 PM, Flavio Pompermaier > > wrote: >> >>> Which file and which JVM options do I have to modify to try options 1 >>> and 3..? >>> >>> 1. Don't fill the JVMs up to the limit with objects. Give more >>> memory to the JVM, or give less memory to Flink managed memory >>> 2. Use more JVMs, i.e., a higher parallelism >>> 3. Use a concurrent garbage collector, like G1 >>> >>> Actually, when I run the code from Eclipse I see an exception do to an >>> error in the data (because I try to read a URI that contains illegal >>> characters) but I don't think the program reach that point, I don't see >>> anywhere an exception and the error occur later on in the code.. >>> >>> However, all of your options seems related to a scalability problem, >>> where I should add more resources to complete the work...while it works >>> locally in the IDE where I have less resources (except the gc that I use >>> default settings while I don't know if the cluster has some default >>> ones)..isn't it strange? >>> >>> On Mon, Jun 29, 2015 at 2:29 PM, Stephan Ewen wrote: >>> >>>> Hi Flavio! >>>> >>>> I had a look at the logs. There seems nothing suspicious - at some >>>> point, the TaskManager and JobManager declare each other unreachable. >>>> >>>> A pretty common cause for that is that the JVMs stall for a long time >>>> due to garbage collection. The JobManager cannot see the difference between >>>> a JVM that is irresponsive (due to garbage collection) and a JVM that is >>>> dead. >>>> >>>> Here is what you can do to prevent long garbage collection stalls: >>>> >>>> - Don't fill the JVMs up to the limit with objects. Give more memory >>>> to the JVM, or give less memory to Flink managed memory. >>>> - Use more JVMs, i.e., a higher parallelism. >>>> - Use a concurrent garbage collector, like G1. >>>> >>>> >>>> Greetings, >>>> Stephan >>>> >>>> >>>> On Mon, Jun 29, 2015 at 12:39 PM, Stephan Ewen >>>> wrote: >>>> >>>>> Hi Flavio! >>>>> >>>>> Can you post the JobManager's log here? It should have the message >>>>> about what is going wrong... >>>>> >>>>> Stephan >>>>> >>>>> >>>>> On Mon, Jun 29, 2015 at 11:43 AM, Flavio Pompermaier < >>>>> pompermaier@okkam.it> wrote: >>>>> >>>>>> Hi to all, >>>>>> >>>>>> I'm restarting the discussion about a problem I alredy dicussed on >>>>>> this mailing list (but that started with a different subject). >>>>>> I'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as: >>>>>> >>>>>> mvn clean install -Dhadoop.version=2.3.0-cdh5.1.3 >>>>>> -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3 >>>>>> -DskipTests -Pvendor-repos >>>>>> >>>>>> The problem I'm facing is that the cluster start successfully but >>>>>> when I run my job (from the web-client) I get, after some time, this >>>>>> exception: >>>>>> >>>>>> 16:35:41,636 WARN akka.remote.RemoteWatcher >>>>>> - Detected unreachable: [akka.tcp:// >>>>>> flink@192.168.234.83:6123] >>>>>> 16:35:46,605 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Disconnecting from JobManager: JobManager is no longer reachable >>>>>> 16:35:46,614 INFO org.apache.flink.runtime.taskmanager.TaskManager >>>>>> - Cancelling all computations and discarding all cached data. >>>>>> 16:35:46,644 INFO org.apache.flink.runtime.taskmanager.Task >>>>>> - Attempting to fail task externally CHAIN GroupReduce (GroupReduce >>>>>> at compactDataSources(MyClass.java:213)) -> Combine(Distinct at >>>>>> compactDataSources(MyClass.java:213)) (8/36) >>>>>> 16:35:46,669 INFO org.apache.flink.runtime.taskmanager.Task >>>>>> - CHAIN GroupReduce (GroupReduce at >>>>>> compactDataSources(MyClass.java:213)) -> Combine(Distinct at >>>>>> compactDataSources(MyClass.java:213)) (8/36) switched to FAILED with >>>>>> exception. >>>>>> java.lang.Exception: Disconnecting from JobManager: JobManager is no >>>>>> longer reachable >>>>>> at org.apache.flink.runtime.taskmanager.TaskManager.org >>>>>> $apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerDisconnect(TaskManager.scala:741) >>>>>> at >>>>>> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:267) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>>>> at >>>>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) >>>>>> at >>>>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>>>> at >>>>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) >>>>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>>>> at >>>>>> org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:114) >>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>>>> at >>>>>> akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) >>>>>> at >>>>>> akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) >>>>>> at >>>>>> akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) >>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:486) >>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>> 16:35:46,767 INFO org.apache.flink.runtime.taskmanager.Task >>>>>> - Triggering cancellation of task code CHAIN GroupReduce >>>>>> (GroupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct >>>>>> at compactDataSources(MyClass.java:213)) (8/36) >>>>>> (57a0ad78726d5ba7255aa87038250c51). >>>>>> >>>>>> The job instead runs correctly from the IDE (Eclipse). How can I >>>>>> understand/debug what's wrong? >>>>>> >>>>>> Best, >>>>>> Flavio >>>>>> >>>>>> >>>>> >>>> >>> >>> > --001a114e4e8e95a1cc051b631a8b Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I think that the problem is that the error was caused by a= class logging through java.utils.logging and in the have those =C2=A0logs = working I had to put=C2=A0SLF4JBridgeHandler.install(); at the begin= ning of the main().
Probably this should be documented..actually I don&= #39;t know why this worked :)

On Tue, Jul 21, 2015 at 3:29 PM, Stephan Ewen <sewen@apach= e.org> wrote:
Exceptions= are swallowed upon canceling (because canceling has usually followup excep= tions).

Root error cause exceptions should never be swal= lowed.

Do you have a specific place in mind where = that happens?

On Mon, Jun 29, 2015 at 4:49 PM, Flavio Pompermaier <= pompermaier@okkam.it> wrote:
I think that actually there's an Exception thrown within the code = that I suspect it's not reported anywhere..could it be?

On Mon, Jun 29, 2015 at= 3:28 PM, Flavio Pompermaier <pompermaier@okkam.it> wrote= :
Which file and which JVM options do = I have to modify to try options 1 and 3..?= =C2=A0
  1. Don't fi= ll the JVMs up to the limit with objects. Give more memory to the JVM, or g= ive less memory to Flink managed memory
  2. Use more JVMs, i.e., a higher parallelism
  3. Use a concurrent garbage = collector, like G1
Actually, when I run the cod= e from Eclipse I see an exception do to an error in the data (because I try= to read a URI that contains illegal characters) but I don't think the = program reach that point, I don't see anywhere an exception and the err= or occur later on in the code..

However, all= of your options seems related to a scalability problem, where I should add= more resources to complete the work...while it works locally in the IDE wh= ere I have less resources (except the gc that I use default settings while = I don't know if the cluster has some default ones)..isn't it strang= e?

On Mon, Jun 29, 2015 at 2:29 PM, Stephan Ewen <sewen@apache.org> wrote:
Hi Flavio!

<= div>I had a look at the logs. There seems nothing suspicious - at some poin= t, the TaskManager and JobManager declare each other unreachable.

A pretty common cause for that is that the JVMs stall for a= long time due to garbage collection. The JobManager cannot see the differe= nce between a JVM that is irresponsive (due to garbage collection) and a JV= M that is dead.

Here is what you can do to prevent= long garbage collection stalls:

=C2=A0- Don't= fill the JVMs up to the limit with objects. Give more memory to the JVM, o= r give less memory to Flink managed memory.
=C2=A0- Use more JVMs= , i.e., a higher parallelism.
=C2=A0- Use a concurrent garbage co= llector, like G1.


Greetings,
<= div>Stephan


On Mon, Jun 29, 2015 at 12:39 PM, Stephan Ewen <= span dir=3D"ltr"><= sewen@apache.org> wrote:
Hi Flavio!

Can you post the JobManager= 's log here? It should have the message about what is going wrong...

Stephan

=

On Mon, Jun 29, 2015 at 11:43 AM, Flavio Pompermaier <pompermaier@okkam.it> wrote:
Hi to all,

=
I'm restarting the discussion about a problem I alredy dicussed on= this mailing list (but that started with a different subject).
I= 'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as:
=

mvn clean =C2=A0install -Dhadoop.version=3D2.3.0-cdh5.1= .3 -Dhbase.version=3D0.98.1-cdh5.1.3 -Dhadoop.core.version=3D2.3.0-mr1-cdh5= .1.3 -DskipTests -Pvendor-repos

The problem I&= #39;m facing is that the cluster start successfully but when I run my job (= from the web-client) I get, after some time, this exception:

=
16:35:41,636 WARN =C2=A0akka.remote.RemoteWatcher =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Detected unreachable: [akka= .tcp://flink= @192.168.234.83:6123]
16:35:46,605 INFO =C2=A0org.apache.flin= k.runtime.taskmanager.TaskManager =C2=A0 - Disconnecting from JobManager: J= obManager is no longer reachable
16:35:46,614 INFO =C2=A0org.apac= he.flink.runtime.taskmanager.TaskManager =C2=A0 - Cancelling all computatio= ns and discarding all cached data.
16:35:46,644 INFO =C2=A0org.ap= ache.flink.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 - Attempting to fail task externally CHAIN GroupReduce (G= roupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct = at compactDataSources(MyClass.java:213)) (8/36)
16:35:46,669 INFO= =C2=A0org.apache.flink.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - CHAIN GroupReduce (GroupReduce at compact= DataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSources= (MyClass.java:213)) (8/36) switched to FAILED with exception.
jav= a.lang.Exception: Disconnecting from JobManager: JobManager is no longer re= achable
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.= flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmana= ger$TaskManager$$handleJobManagerDisconnect(TaskManager.scala:741)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanager.TaskM= anager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:267)=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtime.AbstractPartialFunc= tion$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtime.AbstractPartialFunction$mcVL$sp.a= pply(AbstractPartialFunction.scala:33)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialF= unction.scala:25)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink= .runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.ActorLogMessages$= $anon$1.apply(ActorLogMessages.scala:29)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.ActorLogMe= ssages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanag= er.TaskManager.aroundReceive(TaskManager.scala:114)
=C2=A0 =C2=A0= =C2=A0 =C2=A0 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.dungeon.DeathWatch$clas= s.receivedTerminated(DeathWatch.scala:46)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.ActorCell.autoReceiveMessag= e(ActorCell.scala:501)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.= ActorCell.invoke(ActorCell.scala:486)
=C2=A0 =C2=A0 =C2=A0 =C2=A0= at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mailbox.exec(Mailbox.s= cala:231)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoi= n.ForkJoinTask.doExec(ForkJoinTask.java:260)
=C2=A0 =C2=A0 =C2=A0= =C2=A0 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(= ForkJoinPool.java:1253)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.conc= urrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoin.ForkJoinPool= .runWorker(ForkJoinPool.java:1979)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at= scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja= va:107)
16:35:46,767 INFO =C2=A0org.apache.flink.runtime.taskmana= ger.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 - Triggering cancellation of task code CHAIN GroupReduce (GroupReduce a= t compactDataSources(MyClass.java:213)) -> Combine(Distinct at compactDa= taSources(MyClass.java:213)) (8/36) (57a0ad78726d5ba7255aa87038250c51).

The job instead runs correctly from the IDE (Ec= lipse). How can I understand/debug what's wrong?

Best,
Flavio





=



<= /p>

--001a114e4e8e95a1cc051b631a8b--