Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2867418842 for ; Mon, 29 Jun 2015 09:43:49 +0000 (UTC) Received: (qmail 53740 invoked by uid 500); 29 Jun 2015 09:43:49 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 53656 invoked by uid 500); 29 Jun 2015 09:43:49 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 53646 invoked by uid 99); 29 Jun 2015 09:43:49 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2015 09:43:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 8B3C51A62E1 for ; Mon, 29 Jun 2015 09:43:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.002 X-Spam-Level: **** X-Spam-Status: No, score=4.002 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, KAM_LAZY_DOMAIN_SECURITY=1, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id yrUHt4dEE5lg for ; Mon, 29 Jun 2015 09:43:40 +0000 (UTC) Received: from mail-yk0-f180.google.com (mail-yk0-f180.google.com [209.85.160.180]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 4137E428CC for ; Mon, 29 Jun 2015 09:43:40 +0000 (UTC) Received: by ykdt186 with SMTP id t186so109806935ykd.0 for ; Mon, 29 Jun 2015 02:43:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=p92IfQa76ZT9x49F3NGze2MLBSHPNg16ka8JveYmHEs=; b=gkrS6IHhb2V2aCEPWL0GvnUfCPf5KoCzuuVsAIcBqG9TVBDsC5Q92+W14B4wFVaRTV AS8xc0RiT7ENYGPzVuWjpNi9HTpFLMmb9eZgEpMryxCc6CfjZXa6hqb8DVjjHeu1fSAC 5pUoitmC3fPoMbb7VyBkk2tDShj8bydqifP3+U55NpPmtvleiaogpryKwQxBH2K1RPoa hmMOeHnSNDDSB5UYyfHKIh99JcYVyZjdHPLisgBrpd4TsHNnnRtovukvCKZ9++YacTL7 IJGEc2DfusDv0EoFMXHPou92EZN3wcwA3lsUFu9+S2bLd7fFT0SHahYnRrES3o/sNrzZ PoNA== X-Gm-Message-State: ALoCoQln1RhJ5F/HeLfiVoe4H+it5zxp2pJKnqqqupqWynSdxlODefs1LpMSPcEs7nMgZabE1hRT X-Received: by 10.129.45.68 with SMTP id t65mr18387982ywt.152.1435571013864; Mon, 29 Jun 2015 02:43:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.70.10 with HTTP; Mon, 29 Jun 2015 02:43:14 -0700 (PDT) X-Originating-IP: [213.203.177.29] From: Flavio Pompermaier Date: Mon, 29 Jun 2015 11:43:14 +0200 Message-ID: Subject: JobManager is no longer reachable To: user Content-Type: multipart/alternative; boundary=001a1141e9b254ff910519a4ec12 --001a1141e9b254ff910519a4ec12 Content-Type: text/plain; charset=UTF-8 Hi to all, I'm restarting the discussion about a problem I alredy dicussed on this mailing list (but that started with a different subject). I'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as: mvn clean install -Dhadoop.version=2.3.0-cdh5.1.3 -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3 -DskipTests -Pvendor-repos The problem I'm facing is that the cluster start successfully but when I run my job (from the web-client) I get, after some time, this exception: 16:35:41,636 WARN akka.remote.RemoteWatcher - Detected unreachable: [akka.tcp://flink@192.168.234.83:6123] 16:35:46,605 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disconnecting from JobManager: JobManager is no longer reachable 16:35:46,614 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data. 16:35:46,644 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally CHAIN GroupReduce (GroupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSources(MyClass.java:213)) (8/36) 16:35:46,669 INFO org.apache.flink.runtime.taskmanager.Task - CHAIN GroupReduce (GroupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSources(MyClass.java:213)) (8/36) switched to FAILED with exception. java.lang.Exception: Disconnecting from JobManager: JobManager is no longer reachable at org.apache.flink.runtime.taskmanager.TaskManager.org $apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerDisconnect(TaskManager.scala:741) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:267) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:114) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) at akka.actor.ActorCell.invoke(ActorCell.scala:486) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) at akka.dispatch.Mailbox.run(Mailbox.scala:221) at akka.dispatch.Mailbox.exec(Mailbox.scala:231) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 16:35:46,767 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code CHAIN GroupReduce (GroupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSources(MyClass.java:213)) (8/36) (57a0ad78726d5ba7255aa87038250c51). The job instead runs correctly from the IDE (Eclipse). How can I understand/debug what's wrong? Best, Flavio --001a1141e9b254ff910519a4ec12 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi to all,
=

I'm restarting the discussion about a problem I alredy dicussed on= this mailing list (but that started with a different subject).
I= 'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as:
=

mvn clean =C2=A0install -Dhadoop.version=3D2.3.0-cdh5.1= .3 -Dhbase.version=3D0.98.1-cdh5.1.3 -Dhadoop.core.version=3D2.3.0-mr1-cdh5= .1.3 -DskipTests -Pvendor-repos

The problem I&= #39;m facing is that the cluster start successfully but when I run my job (= from the web-client) I get, after some time, this exception:

=
16:35:41,636 WARN =C2=A0akka.remote.RemoteWatcher =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Detected unreachable: [akka= .tcp://flink@192.168.234.83:61= 23]
16:35:46,605 INFO =C2=A0org.apache.flink.runtime.taskmana= ger.TaskManager =C2=A0 - Disconnecting from JobManager: JobManager is no lo= nger reachable
16:35:46,614 INFO =C2=A0org.apache.flink.runtime.t= askmanager.TaskManager =C2=A0 - Cancelling all computations and discarding = all cached data.
16:35:46,644 INFO =C2=A0org.apache.flink.runtime= .taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 -= Attempting to fail task externally CHAIN GroupReduce (GroupReduce at compa= ctDataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSourc= es(MyClass.java:213)) (8/36)
16:35:46,669 INFO =C2=A0org.apache.f= link.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 - CHAIN GroupReduce (GroupReduce at compactDataSources(MyClass.j= ava:213)) -> Combine(Distinct at compactDataSources(MyClass.java:213)) (= 8/36) switched to FAILED with exception.
java.lang.Exception: Dis= connecting from JobManager: JobManager is no longer reachable
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanager.TaskManager.o= rg$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerDiscon= nect(TaskManager.scala:741)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.ap= ache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$= 1.applyOrElse(TaskManager.scala:267)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPart= ialFunction.scala:33)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtim= e.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtime.AbstractPartialFuncti= on$mcVL$sp.apply(AbstractPartialFunction.scala:25)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(Ac= torLogMessages.scala:36)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apach= e.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.PartialFunction$class.applyOr= Else(PartialFunction.scala:118)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at or= g.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessage= s.scala:29)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.Actor$class= .aroundReceive(Actor.scala:465)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at or= g.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.sc= ala:114)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.ActorCell.rece= iveMessage(ActorCell.scala:516)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at ak= ka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.ActorCell.receivedTermin= ated(ActorCell.scala:369)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.act= or.ActorCell.autoReceiveMessage(ActorCell.scala:501)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at akka.actor.ActorCell.invoke(ActorCell.scala:486)
=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mailbox.processMailbox(Ma= ilbox.scala:254)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mai= lbox.run(Mailbox.scala:221)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.d= ispatch.Mailbox.exec(Mailbox.scala:231)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoin.ForkJoi= nPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTa= sk(ForkJoinPool.java:1346)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.c= oncurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoin.ForkJoinWorkerThr= ead.run(ForkJoinWorkerThread.java:107)
16:35:46,767 INFO =C2=A0or= g.apache.flink.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Triggering cancellation of task code C= HAIN GroupReduce (GroupReduce at compactDataSources(MyClass.java:213)) ->= ; Combine(Distinct at compactDataSources(MyClass.java:213)) (8/36) (57a0ad7= 8726d5ba7255aa87038250c51).

The job instead = runs correctly from the IDE (Eclipse). How can I understand/debug what'= s wrong?

Best,
Flavio

--001a1141e9b254ff910519a4ec12--