Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <CANC1h_sZQsYfmFuk_MVTgpk0_ZxegC-zjQLojAmzbhEnbJDcVA@mail.gmail.com>
References: <5553475E.3070406@informatik.hu-berlin.de>
	<CANC1h_sZQsYfmFuk_MVTgpk0_ZxegC-zjQLojAmzbhEnbJDcVA@mail.gmail.com>
Date: Wed, 13 May 2015 15:01:20 +0200
Message-ID: 
 <CANC1h_tcFYWCQvHupuK7a7wHnWQ2GS87gKtE-t7nUG3en0HSoQ@mail.gmail.com>
Subject: Re: Flink hanging between job executions / All Pairs Shortest Paths
From: Stephan Ewen <sewen@apache.org>
To: user@flink.apache.org
Content-Type: multipart/alternative; boundary=089e01538ab41d124b0515f635d4

--089e01538ab41d124b0515f635d4
Content-Type: text/plain; charset=UTF-8

BTW, you should be able to see that when, instead of executing the program,
you print the execution plan.

I am not sure where the hang comes from. Is it an actual hang, or does it
just take long? If it is a hang, does it occur in the optimizer, or in the
distributed runtime?


On Wed, May 13, 2015 at 3:00 PM, Stephan Ewen <sewen@apache.org> wrote:

> I think this is a good case where loops in the program can cause issues
> right now.
>
> The next graph always depends on the previous graph. This is a bit like a
> recursive definition. In the 10th iteration, in order to execute the
> print() command, you need to compute the 9th graph, which requires the 8th
> graph, ...
> It is like the inefficient recursive way of computing the Fibonacci
> Numbers.
>
> The only way go get around that is actually strictly caching the
> intermediate data set. Flink sill support that internally a few weeks (lets
> see if it is in time for 0.9, may not). Until then, you need to explicitly
> persist the graph after each loop iteration.
>
>
> On Wed, May 13, 2015 at 2:45 PM, Mihail Vieru <
> vieru@informatik.hu-berlin.de> wrote:
>
>>  Hi all,
>>
>> I've got a problem when running the attached APSPNaiveJob on a graph with
>> just 1000 vertices (local execution; 0.9-SNAPSHOT).
>> It solves the AllPairsShortestPaths problem the naive way - executing
>> SingleSourceShortestPaths n times - and storing the computed distances in a
>> distance vector for each vertex.
>>
>> The problem is that Flink almost comes to a standstill when it reaches
>> 20th iteration, i.e. computing SSSP with srcVertexId = 20. The net runtime
>> is becoming increasingly larger than the total runtime by each iteration,
>> Flink hanging between executions.
>>
>> I didn't have this problem when each vertex didn't contain a distance
>> vector, but just one distance value. It ran SSSP 1000 times without any
>> issues.
>>
>> The loop:
>>
>> *        while (srcVertexId < numOfVertices) {*
>> *            System.out.println("!!! Executing SSSP for srcVertexId = " +
>> srcVertexId);*
>>
>> *            graph = graph.run(new APSP<Long>(srcVertexId,
>> maxIterations));*
>>
>> *            graph.getVertices().print();*
>>
>> *            intermediateResult = env.execute("APSPNaive");*
>> *            jobRuntime += intermediateResult.getNetRuntime();*
>>
>>
>> *            srcVertexId++;         }*
>>
>> And the program arguments (first being *srcVertexId* and second
>> *numOfVertices* used in the loop):
>>
>> *0 30
>> /home/vieru/dev/flink-experiments/data/social_network.verticeslistwweights-1k2
>> /home/vieru/dev/flink-experiments/data/social_network.edgelist-1k
>> /home/vieru/dev/flink-experiments/sssp-output-x-higgstwitter 10*
>>
>> Do you know what could cause this problem?
>>
>> I would greatly appreciate any help.
>>
>> Best,
>> Mihail
>>
>
>

--089e01538ab41d124b0515f635d4
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">BTW, you should be able to see that when, instead of execu=
ting the program, you print the execution plan.<div><br></div><div>I am not=
 sure where the hang comes from. Is it an actual hang, or does it just take=
 long? If it is a hang, does it occur in the optimizer, or in the distribut=
ed runtime?</div><div><br></div></div><div class=3D"gmail_extra"><br><div c=
lass=3D"gmail_quote">On Wed, May 13, 2015 at 3:00 PM, Stephan Ewen <span di=
r=3D"ltr">&lt;<a href=3D"mailto:sewen@apache.org" target=3D"_blank">sewen@a=
pache.org</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr">I think this is a good case where loops in the program can cause i=
ssues right now.<div><br></div><div>The next graph always depends on the pr=
evious graph. This is a bit like a recursive definition. In the 10th iterat=
ion, in order to execute the print() command, you need to compute the 9th g=
raph, which requires the 8th graph, ...</div><div>It is like the inefficien=
t recursive way of computing the Fibonacci Numbers.</div><div><br></div><di=
v>The only way go get around that is actually strictly caching the intermed=
iate data set. Flink sill support that internally a few weeks (lets see if =
it is in time for 0.9, may not). Until then, you need to explicitly persist=
 the graph after each loop iteration.</div><div><br></div></div><div class=
=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Wed, May 13, 2015 at 2:45 PM, Mihail Vieru <span dir=3D"ltr=
">&lt;<a href=3D"mailto:vieru@informatik.hu-berlin.de" target=3D"_blank">vi=
eru@informatik.hu-berlin.de</a>&gt;</span> wrote:<br><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
 =20

   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    Hi all,<br>
    <br>
    I&#39;ve got a problem when running the attached APSPNaiveJob on a grap=
h
    with just 1000 vertices (local execution; 0.9-SNAPSHOT).<br>
    It solves the AllPairsShortestPaths problem the naive way -
    executing SingleSourceShortestPaths n times - and storing the
    computed distances in a distance vector for each vertex.<br>
    <br>
    The problem is that Flink almost comes to a standstill when it
    reaches 20th iteration, i.e. computing SSSP with srcVertexId =3D 20.
    The net runtime is becoming increasingly larger than the total
    runtime by each iteration, Flink hanging between executions.<br>
    <br>
    I didn&#39;t have this problem when each vertex didn&#39;t contain a
    distance vector, but just one distance value. It ran SSSP 1000 times
    without any issues.<br>
    <br>
    The loop:<br>
    <br>
    <i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 while (srcVertexId &lt; numOfV=
ertices) {</i><i><br>
    </i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 System.=
out.println(&quot;!!! Executing SSSP for
      srcVertexId =3D &quot; + srcVertexId);</i><i><br>
    </i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 </i><i>=
<br>
    </i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 graph =
=3D graph.run(new
      APSP&lt;Long&gt;(srcVertexId, maxIterations));</i><i><br>
    </i><i><br>
    </i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 graph.g=
etVertices().print();</i><i><br>
    </i><i><br>
    </i><i></i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =
intermediateResult =3D
      env.execute(&quot;APSPNaive&quot;);</i><i><br>
    </i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 jobRunt=
ime +=3D intermediateResult.getNetRuntime();</i><i><br>
    </i><br>
    <i><i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 srcVerte=
xId++;</i><i><br>
      </i>=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 }</i><br>
    <br>
    And the program arguments (first being <b>srcVertexId</b> and
    second <b>numOfVertices</b> used in the loop):<br>
    <br>
    <i>0 30
      /home/vieru/dev/flink-experiments/data/social_network.verticeslistwwe=
ights-1k2
      /home/vieru/dev/flink-experiments/data/social_network.edgelist-1k
      /home/vieru/dev/flink-experiments/sssp-output-x-higgstwitter 10</i><b=
r>
    <br>
    Do you know what could cause this problem? <br>
    <br>
    I would greatly appreciate any help.<br>
    <br>
    Best,<br>
    Mihail<br>
  </div>

</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--089e01538ab41d124b0515f635d4--