Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (athena.apache.org: domain of claudio.martella@gmail.com
 designates 209.85.212.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BAY172-W98B9164777A4B3C0B37BDCABD0@phx.gbl>
References: <BAY172-W98B9164777A4B3C0B37BDCABD0@phx.gbl>
From: Claudio Martella <claudio.martella@gmail.com>
Date: Thu, 2 May 2013 08:09:00 +0200
Message-ID: 
 <CAFJOoJfBpd6oTEsbn5VxR3Oz6zFXA9op=YEYDDEyy2df9Cs0ug@mail.gmail.com>
Subject: Re: Can I use Giraph on an application with two maps but no reduce?
To: "user@giraph.apache.org" <user@giraph.apache.org>
Content-Type: multipart/alternative; boundary=089e011770f9468cac04dbb6132c

--089e011770f9468cac04dbb6132c
Content-Type: text/plain; charset=ISO-8859-1

The question is: do you have 100GB of main-memory? How big are your
messages going to be? How dense is the graph?
Although we have out-of-core facilities, it looks to me not like a typical
graph algorithm, and in particular not one that would particularly take
advantage of Giraph compared to MapReduce. This is because it has a low
number of iterations (two), and hence, in particular if you have memory
constraints, it could work out pretty easily with MapReduce. Also, it looks
to me like a map/reduce job, there the reducer could do the second
iterations, but I could miss some details. As far as load-balancing is
concerned, i guess it depends on your degree distribution. Having a
"random" distribution of vertices through hash-partitioning should back you
up, but if you have a bunch of nodes that are much more active, you could
have some stranglers.


On Thu, May 2, 2013 at 2:12 AM, Hadoop Explorer
<hadoopexplorer@outlook.com>wrote:

> I have an application that evaluate a graph using this algorithm:
>
> - use a parallel for loop to evaluate all nodes in a graph (to evaluate a
> node, an image is read, and then result of this node is calculated)
>
> - use a second parallel for loop to evaluate all edges in the graph.  The
> function would take in results from both nodes of the edge, and then
> calculate the answer for the edge
>
> The final result will consist of calculated results of each edge.  So each
> node, and each edge is essentially a job, and in this case, an edge is more
> like a job than a message
>
> As you can see, the above algorithm would employ two map functions, but no
> reduce function.  The total data size can be very large (say 100GB).  Also,
> the workload of each node and each edge is highly irregular, and thus load
> balancing mechanisms are essential.
>
> In this case, will giraph suit this application?  if so, how will my
> program like?  And will giraph be able to strike the balance between a good
> load balancing of the second map function, and minimizing data transfer of
> the results from the first map function?
>
>
>


-- 
   Claudio Martella
   claudio.martella@gmail.com

--089e011770f9468cac04dbb6132c
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">The question is: do you have 100GB of main-memory? How big=
 are your messages going to be? How dense is the graph?<div style>Although =
we have out-of-core facilities, it looks to me not like a typical graph alg=
orithm, and in particular not one that would particularly take advantage of=
 Giraph compared to MapReduce. This is because it has a low number of itera=
tions (two), and hence, in particular if you have memory constraints, it co=
uld work out pretty easily with MapReduce. Also, it looks to me like a map/=
reduce job, there the reducer could do the second iterations, but I could m=
iss some details. As far as load-balancing is concerned, i guess it depends=
 on your degree distribution. Having a &quot;random&quot; distribution of v=
ertices through hash-partitioning should back you up, but if you have a bun=
ch of nodes that are much more active, you could have some stranglers.</div=
>

</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu,=
 May 2, 2013 at 2:12 AM, Hadoop Explorer <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:hadoopexplorer@outlook.com" target=3D"_blank">hadoopexplorer@outlook.=
com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">


<div><div dir=3D"ltr">I have an application that evaluate a graph using thi=
s algorithm:<br><br>-
 use a parallel for loop to evaluate all nodes in a graph (to evaluate a
 node, an image is read, and then result of this node is calculated)<br><br=
>-
 use a second parallel for loop to evaluate all edges in the graph.=A0 The
 function would take in results from both nodes of the edge, and then=20
calculate the answer for the edge<br><br>The final result will consist of c=
alculated results of each edge.=A0 So each node, and each edge is essential=
ly a job, and in this case, an edge is more like a job than a message<br>

<br>As you can see, the above=20
algorithm would employ two map functions, but no reduce function.=A0 The=20
total data size can be very large (say 100GB).=A0 Also, the workload of=20
each node and each edge is highly irregular, and thus load balancing=20
mechanisms are essential.<br><br>In this case, will giraph suit this=20
application?=A0 if so, how will my program like?=A0 And=20
will giraph be able to strike the balance between a good load balancing=20
of the second map function, and minimizing data transfer of the results=20
from the first map function?<br><br><br> 		 	   		  </div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br> =A0 =A0Clau=
dio Martella<br> =A0 =A0<a href=3D"mailto:claudio.martella@gmail.com" targe=
t=3D"_blank">claudio.martella@gmail.com</a>=A0 =A0
</div>

--089e011770f9468cac04dbb6132c--