Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4960718200 for ; Wed, 13 May 2015 13:02:14 +0000 (UTC) Received: (qmail 97899 invoked by uid 500); 13 May 2015 13:02:14 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 97832 invoked by uid 500); 13 May 2015 13:02:14 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 97821 invoked by uid 99); 13 May 2015 13:02:14 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2015 13:02:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id AF9DC182369 for ; Wed, 13 May 2015 13:02:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id aZp-cMStduvv for ; Wed, 13 May 2015 13:02:06 +0000 (UTC) Received: from mail-ig0-f171.google.com (mail-ig0-f171.google.com [209.85.213.171]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 67D4743CB6 for ; Wed, 13 May 2015 13:02:06 +0000 (UTC) Received: by igbhj9 with SMTP id hj9so43812978igb.1 for ; Wed, 13 May 2015 06:01:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=yTdN6t70hH2a6d6ga7N86crha802fIR7638P5IHuJA8=; b=HshkvQrXS2AOtnDsv1pk8qbBYtAIZDjeq4vO4BkR7sObORjEmZ4NTh5X3slZn3gl3f qPLUaucGBj+z34sOAlIhSy+tN0644arpnjS1QvVdDEvxYSUxaEroKVA0R6qJ8PnBJ/Mq eD2sXCRjl7ez7ms+mBnMxwC+XYnA/7/WUWjhNBmL2mnYAaIoGFvCqD9Gt4THXqJYpCvo oSieyAH/uFuqICkC860wP/dcYoMzqvEMnr0Erp4Ip7B8ZEPZk75J9PbycxgKHHBr5aXe MjSwegHLjaw0Fn4KGgJowq4mWa8aKJERUAW+zUSR/pMAEHsW6augM2n2/fp229pN2dQN Ozhg== MIME-Version: 1.0 X-Received: by 10.50.178.230 with SMTP id db6mr10529594igc.26.1431522080734; Wed, 13 May 2015 06:01:20 -0700 (PDT) Sender: ewenstephan@gmail.com Received: by 10.64.133.226 with HTTP; Wed, 13 May 2015 06:01:20 -0700 (PDT) In-Reply-To: References: <5553475E.3070406@informatik.hu-berlin.de> Date: Wed, 13 May 2015 15:01:20 +0200 X-Google-Sender-Auth: iSj8OhvObvwIHb6CRHMACX4-dhQ Message-ID: Subject: Re: Flink hanging between job executions / All Pairs Shortest Paths From: Stephan Ewen To: user@flink.apache.org Content-Type: multipart/alternative; boundary=089e01538ab41d124b0515f635d4 --089e01538ab41d124b0515f635d4 Content-Type: text/plain; charset=UTF-8 BTW, you should be able to see that when, instead of executing the program, you print the execution plan. I am not sure where the hang comes from. Is it an actual hang, or does it just take long? If it is a hang, does it occur in the optimizer, or in the distributed runtime? On Wed, May 13, 2015 at 3:00 PM, Stephan Ewen wrote: > I think this is a good case where loops in the program can cause issues > right now. > > The next graph always depends on the previous graph. This is a bit like a > recursive definition. In the 10th iteration, in order to execute the > print() command, you need to compute the 9th graph, which requires the 8th > graph, ... > It is like the inefficient recursive way of computing the Fibonacci > Numbers. > > The only way go get around that is actually strictly caching the > intermediate data set. Flink sill support that internally a few weeks (lets > see if it is in time for 0.9, may not). Until then, you need to explicitly > persist the graph after each loop iteration. > > > On Wed, May 13, 2015 at 2:45 PM, Mihail Vieru < > vieru@informatik.hu-berlin.de> wrote: > >> Hi all, >> >> I've got a problem when running the attached APSPNaiveJob on a graph with >> just 1000 vertices (local execution; 0.9-SNAPSHOT). >> It solves the AllPairsShortestPaths problem the naive way - executing >> SingleSourceShortestPaths n times - and storing the computed distances in a >> distance vector for each vertex. >> >> The problem is that Flink almost comes to a standstill when it reaches >> 20th iteration, i.e. computing SSSP with srcVertexId = 20. The net runtime >> is becoming increasingly larger than the total runtime by each iteration, >> Flink hanging between executions. >> >> I didn't have this problem when each vertex didn't contain a distance >> vector, but just one distance value. It ran SSSP 1000 times without any >> issues. >> >> The loop: >> >> * while (srcVertexId < numOfVertices) {* >> * System.out.println("!!! Executing SSSP for srcVertexId = " + >> srcVertexId);* >> >> * graph = graph.run(new APSP(srcVertexId, >> maxIterations));* >> >> * graph.getVertices().print();* >> >> * intermediateResult = env.execute("APSPNaive");* >> * jobRuntime += intermediateResult.getNetRuntime();* >> >> >> * srcVertexId++; }* >> >> And the program arguments (first being *srcVertexId* and second >> *numOfVertices* used in the loop): >> >> *0 30 >> /home/vieru/dev/flink-experiments/data/social_network.verticeslistwweights-1k2 >> /home/vieru/dev/flink-experiments/data/social_network.edgelist-1k >> /home/vieru/dev/flink-experiments/sssp-output-x-higgstwitter 10* >> >> Do you know what could cause this problem? >> >> I would greatly appreciate any help. >> >> Best, >> Mihail >> > > --089e01538ab41d124b0515f635d4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
BTW, you should be able to see that when, instead of execu= ting the program, you print the execution plan.

I am not= sure where the hang comes from. Is it an actual hang, or does it just take= long? If it is a hang, does it occur in the optimizer, or in the distribut= ed runtime?


On Wed, May 13, 2015 at 3:00 PM, Stephan Ewen <sewen@a= pache.org> wrote:
I think this is a good case where loops in the program can cause i= ssues right now.

The next graph always depends on the pr= evious graph. This is a bit like a recursive definition. In the 10th iterat= ion, in order to execute the print() command, you need to compute the 9th g= raph, which requires the 8th graph, ...
It is like the inefficien= t recursive way of computing the Fibonacci Numbers.

The only way go get around that is actually strictly caching the intermed= iate data set. Flink sill support that internally a few weeks (lets see if = it is in time for 0.9, may not). Until then, you need to explicitly persist= the graph after each loop iteration.


On Wed, May 13, 2015 at 2:45 PM, Mihail Vieru <vi= eru@informatik.hu-berlin.de> wrote:
=20 =20 =20
Hi all,

I've got a problem when running the attached APSPNaiveJob on a grap= h with just 1000 vertices (local execution; 0.9-SNAPSHOT).
It solves the AllPairsShortestPaths problem the naive way - executing SingleSourceShortestPaths n times - and storing the computed distances in a distance vector for each vertex.

The problem is that Flink almost comes to a standstill when it reaches 20th iteration, i.e. computing SSSP with srcVertexId =3D 20. The net runtime is becoming increasingly larger than the total runtime by each iteration, Flink hanging between executions.

I didn't have this problem when each vertex didn't contain a distance vector, but just one distance value. It ran SSSP 1000 times without any issues.

The loop:

=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 while (srcVertexId < numOfV= ertices) {
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 System.= out.println("!!! Executing SSSP for srcVertexId =3D " + srcVertexId);
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 graph = =3D graph.run(new APSP<Long>(srcVertexId, maxIterations));

=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 graph.g= etVertices().print();

=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 = intermediateResult =3D env.execute("APSPNaive");
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 jobRunt= ime +=3D intermediateResult.getNetRuntime();

=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 srcVerte= xId++;
=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 }


And the program arguments (first being srcVertexId and second numOfVertices used in the loop):

0 30 /home/vieru/dev/flink-experiments/data/social_network.verticeslistwwe= ights-1k2 /home/vieru/dev/flink-experiments/data/social_network.edgelist-1k /home/vieru/dev/flink-experiments/sssp-output-x-higgstwitter 10
Do you know what could cause this problem?

I would greatly appreciate any help.

Best,
Mihail


--089e01538ab41d124b0515f635d4--