Return-Path: X-Original-To: apmail-flink-dev-archive@www.apache.org Delivered-To: apmail-flink-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9D0BE17E91 for ; Fri, 19 Jun 2015 12:40:17 +0000 (UTC) Received: (qmail 69592 invoked by uid 500); 19 Jun 2015 12:40:17 -0000 Delivered-To: apmail-flink-dev-archive@flink.apache.org Received: (qmail 69535 invoked by uid 500); 19 Jun 2015 12:40:17 -0000 Mailing-List: contact dev-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list dev@flink.apache.org Received: (qmail 69524 invoked by uid 99); 19 Jun 2015 12:40:17 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Jun 2015 12:40:17 +0000 Received: from mail-wi0-f180.google.com (mail-wi0-f180.google.com [209.85.212.180]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id E48E21A0094 for ; Fri, 19 Jun 2015 12:40:16 +0000 (UTC) Received: by wiga1 with SMTP id a1so17880879wig.0 for ; Fri, 19 Jun 2015 05:40:15 -0700 (PDT) X-Received: by 10.194.200.42 with SMTP id jp10mr24821823wjc.66.1434717615718; Fri, 19 Jun 2015 05:40:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Till Rohrmann Date: Fri, 19 Jun 2015 12:40:05 +0000 Message-ID: Subject: Re: Flink Runtime Exception To: dev@flink.apache.org Content-Type: multipart/alternative; boundary=047d7bb04272d686f60518de396e --047d7bb04272d686f60518de396e Content-Type: text/plain; charset=UTF-8 Yes, it was an issue for the milestone release. On Fri, Jun 19, 2015 at 2:18 PM Andra Lungu wrote: > Yes, so I am using flink-0.9.0-milestone-1. Was it a problem for this > version? > I'll just fetch the latest master if this is the case. > > On Fri, Jun 19, 2015 at 2:12 PM, Till Rohrmann > wrote: > > > Hi Andra, > > > > the problem seems to be that the deployment of some tasks takes longer > than > > 100s. From the stack trace it looks as if you're not using the latest > > master. > > > > We had problems with previous version where the deployment call waited > for > > the TM to completely download the user code jars. For large setups the > > BlobServer became a bottleneck and some of the deployment calls timed > out. > > We updated the deployment logic so that the TM sends an immediate ACK > backt > > to the JM when it receives a new task. > > > > Could you verify which version of Flink you're running and in case that > > it's not the latest master, could you please try to run your example with > > the latest code? > > > > Cheers, > > Till > > > > On Fri, Jun 19, 2015 at 1:42 PM Andra Lungu > wrote: > > > > > Hi everyone, > > > > > > I ran a job this morning on 30 wally nodes. DOP 224. Worked like a > charm. > > > > > > Then, I ran a similar job, on the exact same configuration, on the same > > > input data set. The only difference between them is that the second job > > > computes the degrees per vertex and, for vertices with degree higher > > than a > > > user-defined threshold, it does a bit of magic(roughly a bunch of > > > coGroups). The problem is that, even before the extra functions get > > called, > > > I get the following type of exception: > > > > > > 06/19/2015 12:06:43 CHAIN FlatMap (FlatMap at > > > fromDataSet(Graph.java:171)) -> Combine(Distinct at > > > fromDataSet(Graph.java:171))(222/224) switched to FAILED > > > java.lang.IllegalStateException: Update task on instance > > > 29073fb0b0957198a2b67569b042d56b @ wally004 - 8 slots - URL: > akka.tcp:// > > > flink@130.149.249.14:44528/user/taskmanager failed due to: > > > at > > > > > > > > > org.apache.flink.runtime.executiongraph.Execution$5.onFailure(Execution.java:860) > > > at akka.dispatch.OnFailure.internal(Future.scala:228) > > > at akka.dispatch.OnFailure.internal(Future.scala:227) > > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174) > > > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171) > > > at > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > > > at > > > > > > > > > scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:25) > > > at > > > scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136) > > > at > > > scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134) > > > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) > > > at > > > > > > > > > scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) > > > at > > > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > > at > > > > > > > > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > > at > > > > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > > at > > > > > > > > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > > > [Actor[akka.tcp://flink@130.149.249.14:44528/user/taskmanager#82700874 > ]] > > > after [100000 ms] > > > at > > > > > > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > > > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > > > at > > > > > > > > > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > > > at > > > > > > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > > > at > > > > > > > > > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > > > at > > > > > > > > > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > > > at > > > > > > > > > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > > > at > > > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > > > at java.lang.Thread.run(Thread.java:722) > > > > > > > > > At first I thought, okay maybe wally004 is down; then I ssh'd into it. > > > Works fine. > > > > > > The full output can be found here: > > > https://gist.github.com/andralungu/d222b75cb33aea57955d > > > > > > Does anyone have any idea about what may have triggered this? :( > > > > > > Thanks! > > > Andra > > > > > > --047d7bb04272d686f60518de396e--