Mailing-List: contact user-help@flink.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of ewenstephan@gmail.com
 designates 209.85.220.169 as permitted sender)
MIME-Version: 1.0
Sender: ewenstephan@gmail.com
In-Reply-To: 
 <CAAfhTVGsFr6=iju1=cF3xzm_uVrTnKE1HTFVHr05JAy2gOfYnA@mail.gmail.com>
References: 
 <CAAfhTVGsFr6=iju1=cF3xzm_uVrTnKE1HTFVHr05JAy2gOfYnA@mail.gmail.com>
Date: Tue, 30 Sep 2014 10:45:54 +0200
Message-ID: 
 <CANC1h_sOCp-0Wtff91=VJx0t1_M--0y3F6Rc+m1uiY53sLUKtg@mail.gmail.com>
Subject: Re: Spargel: Memory runs out at setNewVertexValue()
From: Stephan Ewen <sewen@apache.org>
To: user@flink.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a11c2bfbe49e17c05044469f7

--001a11c2bfbe49e17c05044469f7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hey!

Thanks for the observation. Here is what I can see:

The distribution of hash values is very skewed. One partition has one
buffer as size, the other one 155. Are your objects very different in size,
or is the hash function flawed? More even distribution may help here a lot.

The solution set of the delta iterations is the archillis heel of the
system right now. We are actively working to make memory more adaptive and
give it more if needed. Expect a big fix in a few weeks.

In the mean time, let me try and do a patch for an unofficial non-managed
memory solution set. That should be able to grow into the heap and grab
more memory if needed.

Stephan
Am 29.09.2014 16:11 schrieb "Attila Bern=C3=A1th" <bernath.athos@gmail.com>=
:

> Dear Developers,
>
> We are experimenting with a pagerank-variant, in which the nodes of
> the graph to work with are grouped into supernodes. The nodes send
> messages to supernodes instead of nodes, thus we expect to decrease
> the number of messages and accelerate the algorithm.
> We implemented this algorithm with the Spargel API using the vertex
> centric iterations. The VertexValue type contains all the information
> that a supernode has to know: the list of the nodes grouped into this
> supernode, their current pagerank, their in-neighbours etc.
> We run this algorithm on a cluster containing some 40-50 machines with
> an input graph containing something like 1million nodes. We always get
> the error that one particular machine runs out of memory (always the
> same machine) at the vertex state update. The error message is as
> follows.
>
> Error: The program execution failed: java.lang.RuntimeException:
> Memory ran out. Compaction failed. numPartitions: 32 minPartition: 1
> maxPartition: 155 number of overflow segments: 0 bucketSize: 178
> Overall memory: 32604160 Partition memory: 24248320 Message: null
>     at
> hu.sztaki.ilab.cumulonimbus.custom_pagerank_spargel.SuperNodeRankUpdater.=
updateVertex(SuperNodeRankUpdater.java:71)
>     at
> hu.sztaki.ilab.cumulonimbus.custom_pagerank_spargel.SuperNodeRankUpdater.=
updateVertex(SuperNodeRankUpdater.java:15)
>     at
> org.apache.flink.spargel.java.VertexCentricIteration$VertexUpdateUdf.coGr=
oup(VertexCentricIteration.java:430)
>     at
> org.apache.flink.runtime.operators.CoGroupWithSolutionSetSecondDriver.run=
(CoGroupWithSolutionSetSecondDriver.java:141)
>     at
> org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.ja=
va:510)
>     at
> org.apache.flink.runtime.iterative.task.AbstractIterativePactTask.run(Abs=
tractIterativePactTask.java:137)
>     at
> org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(Iterati=
onTailPactTask.java:109)
>     at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask=
.java:375)
>     at
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironm=
ent.java:265)
>     at java.lang.Thread.run(Thread.java:724)
>
> Line 71 in SuperNodeRankUpdater is a call to the function
> setNewVertexValue().
> Do you have some suggestions? Shall I try to put together some example?
>
> Thank you!
>
> Attila
>

--001a11c2bfbe49e17c05044469f7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Hey! </p>
<p dir=3D"ltr">Thanks for the observation. Here is what I can see:</p>
<p dir=3D"ltr">The distribution of hash values is very skewed. One partitio=
n has one buffer as size, the other one 155. Are your objects very differen=
t in size, or is the hash function flawed? More even distribution may help =
here a lot.</p>
<p dir=3D"ltr">The solution set of the delta iterations is the archillis he=
el of the system right now. We are actively working to make memory more ada=
ptive and give it more if needed. Expect a big fix in a few weeks.</p>
<p dir=3D"ltr">In the mean time, let me try and do a patch for an unofficia=
l non-managed memory solution set. That should be able to grow into the hea=
p and grab more memory if needed.</p>
<p dir=3D"ltr">Stephan</p>
<div class=3D"gmail_quote">Am 29.09.2014 16:11 schrieb &quot;Attila Bern=C3=
=A1th&quot; &lt;<a href=3D"mailto:bernath.athos@gmail.com">bernath.athos@gm=
ail.com</a>&gt;:<br type=3D"attribution"><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Dea=
r Developers,<br>
<br>
We are experimenting with a pagerank-variant, in which the nodes of<br>
the graph to work with are grouped into supernodes. The nodes send<br>
messages to supernodes instead of nodes, thus we expect to decrease<br>
the number of messages and accelerate the algorithm.<br>
We implemented this algorithm with the Spargel API using the vertex<br>
centric iterations. The VertexValue type contains all the information<br>
that a supernode has to know: the list of the nodes grouped into this<br>
supernode, their current pagerank, their in-neighbours etc.<br>
We run this algorithm on a cluster containing some 40-50 machines with<br>
an input graph containing something like 1million nodes. We always get<br>
the error that one particular machine runs out of memory (always the<br>
same machine) at the vertex state update. The error message is as<br>
follows.<br>
<br>
Error: The program execution failed: java.lang.RuntimeException:<br>
Memory ran out. Compaction failed. numPartitions: 32 minPartition: 1<br>
maxPartition: 155 number of overflow segments: 0 bucketSize: 178<br>
Overall memory: 32604160 Partition memory: 24248320 Message: null<br>
=C2=A0 =C2=A0 at hu.sztaki.ilab.cumulonimbus.custom_pagerank_spargel.SuperN=
odeRankUpdater.updateVertex(SuperNodeRankUpdater.java:71)<br>
=C2=A0 =C2=A0 at hu.sztaki.ilab.cumulonimbus.custom_pagerank_spargel.SuperN=
odeRankUpdater.updateVertex(SuperNodeRankUpdater.java:15)<br>
=C2=A0 =C2=A0 at org.apache.flink.spargel.java.VertexCentricIteration$Verte=
xUpdateUdf.coGroup(VertexCentricIteration.java:430)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.operators.CoGroupWithSolutionSetS=
econdDriver.run(CoGroupWithSolutionSetSecondDriver.java:141)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.operators.RegularPactTask.run(Reg=
ularPactTask.java:510)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.iterative.task.AbstractIterativeP=
actTask.run(AbstractIterativePactTask.java:137)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.iterative.task.IterationTailPactT=
ask.run(IterationTailPactTask.java:109)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.operators.RegularPactTask.invoke(=
RegularPactTask.java:375)<br>
=C2=A0 =C2=A0 at org.apache.flink.runtime.execution.RuntimeEnvironment.run(=
RuntimeEnvironment.java:265)<br>
=C2=A0 =C2=A0 at java.lang.Thread.run(Thread.java:724)<br>
<br>
Line 71 in SuperNodeRankUpdater is a call to the function setNewVertexValue=
().<br>
Do you have some suggestions? Shall I try to put together some example?<br>
<br>
Thank you!<br>
<br>
Attila<br>
</blockquote></div>

--001a11c2bfbe49e17c05044469f7--