Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A6E29189F6 for ; Mon, 29 Jun 2015 10:39:40 +0000 (UTC) Received: (qmail 70271 invoked by uid 500); 29 Jun 2015 10:39:40 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 70201 invoked by uid 500); 29 Jun 2015 10:39:40 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 70191 invoked by uid 99); 29 Jun 2015 10:39:40 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Jun 2015 10:39:40 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 14DFD1A62F5 for ; Mon, 29 Jun 2015 10:39:40 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.002 X-Spam-Level: *** X-Spam-Status: No, score=3.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id M4e50wF21ef0 for ; Mon, 29 Jun 2015 10:39:29 +0000 (UTC) Received: from mail-vn0-f42.google.com (mail-vn0-f42.google.com [209.85.216.42]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 48E3E21377 for ; Mon, 29 Jun 2015 10:39:29 +0000 (UTC) Received: by vnav203 with SMTP id v203so23462822vna.8 for ; Mon, 29 Jun 2015 03:39:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=wiheTKi2ObXn4P7icA1Zl4xQf3dMMmUx8UcFDnPLCtI=; b=XNeO/ebczdcRnL8kkhetVHlKEeMySX6dRIENk+Ex70DQUKTLn/5WFwio0uTr207J6Z HuoT/7Y2cMLjKppqnpgT5WxgcC3rn7zR5dZLTiDhESubj0QujxJIM21MJuANpvUu8EO9 sxz4DyglhT3RfSPpjyf8X4eoQQ37CtNoKPCFxi6B3OqFgzzuKDonP51rUmcZaRs4bIrC ok+q726VOEa19bAyZtUFI0Bv/jTPiaJH0XhSZeaXGY5dxoHDam70ybAAOz0sOGwhYs6i /KkuHjwV5dxs8AirPWD3TrjVHi2kufGEu9f2FxqQNLz63eF+rIbrQyzCdBa0iORRJWvy n3iQ== MIME-Version: 1.0 X-Received: by 10.52.53.10 with SMTP id x10mr12355551vdo.36.1435574368455; Mon, 29 Jun 2015 03:39:28 -0700 (PDT) Sender: ewenstephan@gmail.com Received: by 10.31.164.210 with HTTP; Mon, 29 Jun 2015 03:39:28 -0700 (PDT) In-Reply-To: References: Date: Mon, 29 Jun 2015 12:39:28 +0200 X-Google-Sender-Auth: 4Z0ZwAfp4Sof2vkvEyHfhtilVis Message-ID: Subject: Re: JobManager is no longer reachable From: Stephan Ewen To: user@flink.apache.org Content-Type: multipart/alternative; boundary=089e0122f0ee47d1c20519a5b4bd --089e0122f0ee47d1c20519a5b4bd Content-Type: text/plain; charset=UTF-8 Hi Flavio! Can you post the JobManager's log here? It should have the message about what is going wrong... Stephan On Mon, Jun 29, 2015 at 11:43 AM, Flavio Pompermaier wrote: > Hi to all, > > I'm restarting the discussion about a problem I alredy dicussed on this > mailing list (but that started with a different subject). > I'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as: > > mvn clean install -Dhadoop.version=2.3.0-cdh5.1.3 > -Dhbase.version=0.98.1-cdh5.1.3 -Dhadoop.core.version=2.3.0-mr1-cdh5.1.3 > -DskipTests -Pvendor-repos > > The problem I'm facing is that the cluster start successfully but when I > run my job (from the web-client) I get, after some time, this exception: > > 16:35:41,636 WARN akka.remote.RemoteWatcher > - Detected unreachable: [akka.tcp://flink@192.168.234.83:6123] > 16:35:46,605 INFO org.apache.flink.runtime.taskmanager.TaskManager - > Disconnecting from JobManager: JobManager is no longer reachable > 16:35:46,614 INFO org.apache.flink.runtime.taskmanager.TaskManager - > Cancelling all computations and discarding all cached data. > 16:35:46,644 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally CHAIN GroupReduce (GroupReduce at > compactDataSources(MyClass.java:213)) -> Combine(Distinct at > compactDataSources(MyClass.java:213)) (8/36) > 16:35:46,669 INFO org.apache.flink.runtime.taskmanager.Task > - CHAIN GroupReduce (GroupReduce at compactDataSources(MyClass.java:213)) > -> Combine(Distinct at compactDataSources(MyClass.java:213)) (8/36) > switched to FAILED with exception. > java.lang.Exception: Disconnecting from JobManager: JobManager is no > longer reachable > at org.apache.flink.runtime.taskmanager.TaskManager.org > $apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerDisconnect(TaskManager.scala:741) > at > org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:267) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29) > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:114) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at > akka.actor.dungeon.DeathWatch$class.receivedTerminated(DeathWatch.scala:46) > at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369) > at akka.actor.ActorCell.autoReceiveMessage(ActorCell.scala:501) > at akka.actor.ActorCell.invoke(ActorCell.scala:486) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 16:35:46,767 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code CHAIN GroupReduce (GroupReduce > at compactDataSources(MyClass.java:213)) -> Combine(Distinct at > compactDataSources(MyClass.java:213)) (8/36) > (57a0ad78726d5ba7255aa87038250c51). > > The job instead runs correctly from the IDE (Eclipse). How can I > understand/debug what's wrong? > > Best, > Flavio > > --089e0122f0ee47d1c20519a5b4bd Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Flavio!

Can you post the JobManager&= #39;s log here? It should have the message about what is going wrong...

Stephan


On Mon, Jun 29, 2015 at 11:43 AM, Fl= avio Pompermaier <pompermaier@okkam.it> wrote:
Hi to all,

I'm restarting the discussion about a problem I alredy dicussed on= this mailing list (but that started with a different subject).
I= 'm running Flink 0.9.0 on CDH 5.1.3 so I compiled the sources as:
=

mvn clean =C2=A0install -Dhadoop.version=3D2.3.0-cdh5.1= .3 -Dhbase.version=3D0.98.1-cdh5.1.3 -Dhadoop.core.version=3D2.3.0-mr1-cdh5= .1.3 -DskipTests -Pvendor-repos

The problem I&= #39;m facing is that the cluster start successfully but when I run my job (= from the web-client) I get, after some time, this exception:

=
16:35:41,636 WARN =C2=A0akka.remote.RemoteWatcher =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - Detected unreachable: [akka= .tcp://flink= @192.168.234.83:6123]
16:35:46,605 INFO =C2=A0org.apache.flin= k.runtime.taskmanager.TaskManager =C2=A0 - Disconnecting from JobManager: J= obManager is no longer reachable
16:35:46,614 INFO =C2=A0org.apac= he.flink.runtime.taskmanager.TaskManager =C2=A0 - Cancelling all computatio= ns and discarding all cached data.
16:35:46,644 INFO =C2=A0org.ap= ache.flink.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 - Attempting to fail task externally CHAIN GroupReduce (G= roupReduce at compactDataSources(MyClass.java:213)) -> Combine(Distinct = at compactDataSources(MyClass.java:213)) (8/36)
16:35:46,669 INFO= =C2=A0org.apache.flink.runtime.taskmanager.Task =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 - CHAIN GroupReduce (GroupReduce at compact= DataSources(MyClass.java:213)) -> Combine(Distinct at compactDataSources= (MyClass.java:213)) (8/36) switched to FAILED with exception.
jav= a.lang.Exception: Disconnecting from JobManager: JobManager is no longer re= achable
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.= flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmana= ger$TaskManager$$handleJobManagerDisconnect(TaskManager.scala:741)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanager.TaskM= anager$$anonfun$receiveWithLogMessages$1.applyOrElse(TaskManager.scala:267)=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtime.AbstractPartialFunc= tion$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at scala.runtime.AbstractPartialFunction$mcVL$sp.a= pply(AbstractPartialFunction.scala:33)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialF= unction.scala:25)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink= .runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:36)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.ActorLogMessages$= $anon$1.apply(ActorLogMessages.scala:29)
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.ActorLogMe= ssages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at org.apache.flink.runtime.taskmanag= er.TaskManager.aroundReceive(TaskManager.scala:114)
=C2=A0 =C2=A0= =C2=A0 =C2=A0 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)<= /div>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.dungeon.DeathWatch$clas= s.receivedTerminated(DeathWatch.scala:46)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 at akka.actor.ActorCell.receivedTerminated(ActorCell.scala:369)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.ActorCell.autoReceiveMessag= e(ActorCell.scala:501)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.actor.= ActorCell.invoke(ActorCell.scala:486)
=C2=A0 =C2=A0 =C2=A0 =C2=A0= at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at akka.dispatch.Mailbox.exec(Mailbox.s= cala:231)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoi= n.ForkJoinTask.doExec(ForkJoinTask.java:260)
=C2=A0 =C2=A0 =C2=A0= =C2=A0 at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(= ForkJoinPool.java:1253)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.conc= urrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at scala.concurrent.forkjoin.ForkJoinPool= .runWorker(ForkJoinPool.java:1979)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 at= scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.ja= va:107)
16:35:46,767 INFO =C2=A0org.apache.flink.runtime.taskmana= ger.Task =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 - Triggering cancellation of task code CHAIN GroupReduce (GroupReduce a= t compactDataSources(MyClass.java:213)) -> Combine(Distinct at compactDa= taSources(MyClass.java:213)) (8/36) (57a0ad78726d5ba7255aa87038250c51).

The job instead runs correctly from the IDE (Ec= lipse). How can I understand/debug what's wrong?

Best,
Flavio


--089e0122f0ee47d1c20519a5b4bd--