Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (athena.apache.org: domain of pavanka@outlook.com
 designates 65.54.190.226 as permitted sender)
Message-ID: <BAY176-W4968D4279FAF4C79F5073BE860@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_"
From: Pavan Kumar A <pavanka@outlook.com>
To: "user@giraph.apache.org" <user@giraph.apache.org>
Subject: RE: Graph partitioning and data locality
Date: Tue, 4 Nov 2014 20:58:16 +0530
Importance: Normal
In-Reply-To: 
 <CAFJOoJd9Ty7StcUe2dS1PQo2rz2hULBLtE=KxPsog8ydbKyqcA@mail.gmail.com>
References: 
 <545881F0.8080605@gmx.net>,<CAFJOoJd9Ty7StcUe2dS1PQo2rz2hULBLtE=KxPsog8ydbKyqcA@mail.gmail.com>
MIME-Version: 1.0

--_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which =
solves the case where you have a partition map and would like graph to be p=
artitioned that way after loading the input. It does not however solve the =
{do not shuffle data part}

From: claudio.martella@gmail.com
Date: Tue=2C 4 Nov 2014 16:20:21 +0100
Subject: Re: Graph partitioning and data locality
To: user@giraph.apache.org

Hi=2C
answers are inline.
On Tue=2C Nov 4=2C 2014 at 8:36 AM=2C Martin Junghanns <martin.junghanns@gm=
x.net> wrote:
Hi group=2C
=0A=

=0A=
I got a question concerning the graph partitioning step. If I understood th=
e code correctly=2C the graph is distributed to n partitions by using verte=
xID.hashCode() & n. I got two questions concerning that step.
=0A=

=0A=
1) Is the whole graph loaded and partitioned only by the Master? This would=
 mean=2C the whole data has to be moved to that Master map job and then mov=
ed to the physical node the specific worker for the partition runs on. As t=
his sounds like a huge overhead=2C I further inspected the code:
=0A=
I saw that there is also a WorkerGraphPartitioner and I assume he calls the=
 partitioning method on his local data (lets say his local HDFS blocks) and=
 if the resulting partition for a vertex is not himself=2C the data gets mo=
ved to that worker=2C which reduces the overhead. Is this assumption correc=
t?

That is correct=2C workers forward vertex data to the correct worker who is=
 responsible for that vertex via hash-partitioning (by default)=2C meaning =
that the master is not involved. =0A=

=0A=
2) Let's say the graph is already partitioned in the file system=2C e.g. bl=
ocks on physical nodes contain logical connected graph nodes. Is it possibl=
e to just read the data as it is and skip the partitioning step? In that ca=
se I currently assume=2C that the vertexID should contain the partitionID a=
nd the custom partitioning would be an identity function in that case (inst=
ead of hashing or range).

In principle you can. You would need to organize splits so that they contai=
n all the data for each particular worker=2C and then assign relevant split=
s to the corresponding worker. =0A=

=0A=
Thanks for your time and help!
=0A=

=0A=
Cheers=2C
=0A=
Martin
=0A=


--=20
    Claudio Martella
   =0A=
 		 	   		  =

--_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>You can also look at&nbsp=3B<a h=
ref=3D"https://issues.apache.org/jira/browse/GIRAPH-908" target=3D"_blank">=
https://issues.apache.org/jira/browse/GIRAPH-908</a><div>which solves the c=
ase where you have a partition map and would like graph to be partitioned t=
hat way after loading the input. It does not however solve the {do not shuf=
fle data part}<br><br><div><hr id=3D"stopSpelling">From: claudio.martella@g=
mail.com<br>Date: Tue=2C 4 Nov 2014 16:20:21 +0100<br>Subject: Re: Graph pa=
rtitioning and data locality<br>To: user@giraph.apache.org<br><br><div dir=
=3D"ltr">Hi=2C<div><br></div><div>answers are inline.</div><div class=3D"ec=
xgmail_extra"><br><div class=3D"ecxgmail_quote">On Tue=2C Nov 4=2C 2014 at =
8:36 AM=2C Martin Junghanns <span dir=3D"ltr">&lt=3B<a href=3D"mailto:marti=
n.junghanns@gmx.net" target=3D"_blank">martin.junghanns@gmx.net</a>&gt=3B</=
span> wrote:<br><blockquote class=3D"ecxgmail_quote" style=3D"border-left:1=
px #ccc solid=3Bpadding-left:1ex=3B">Hi group=2C<br>=0A=
<br>=0A=
I got a question concerning the graph partitioning step. If I understood th=
e code correctly=2C the graph is distributed to n partitions by using verte=
xID.hashCode() &amp=3B n. I got two questions concerning that step.<br>=0A=
<br>=0A=
1) Is the whole graph loaded and partitioned only by the Master? This would=
 mean=2C the whole data has to be moved to that Master map job and then mov=
ed to the physical node the specific worker for the partition runs on. As t=
his sounds like a huge overhead=2C I further inspected the code:<br>=0A=
I saw that there is also a WorkerGraphPartitioner and I assume he calls the=
 partitioning method on his local data (lets say his local HDFS blocks) and=
 if the resulting partition for a vertex is not himself=2C the data gets mo=
ved to that worker=2C which reduces the overhead. Is this assumption correc=
t?<br></blockquote><div><br></div><div>That is correct=2C workers forward v=
ertex data to the correct worker who is responsible for that vertex via has=
h-partitioning (by default)=2C meaning that the master is not involved.</di=
v><div>&nbsp=3B</div><blockquote class=3D"ecxgmail_quote" style=3D"border-l=
eft:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
<br>=0A=
2) Let's say the graph is already partitioned in the file system=2C e.g. bl=
ocks on physical nodes contain logical connected graph nodes. Is it possibl=
e to just read the data as it is and skip the partitioning step? In that ca=
se I currently assume=2C that the vertexID should contain the partitionID a=
nd the custom partitioning would be an identity function in that case (inst=
ead of hashing or range).<br></blockquote><div><br></div><div>In principle =
you can. You would need to organize splits so that they contain all the dat=
a for each particular worker=2C and then assign relevant splits to the corr=
esponding worker.</div><div>&nbsp=3B</div><blockquote class=3D"ecxgmail_quo=
te" style=3D"border-left:1px #ccc solid=3Bpadding-left:1ex=3B">=0A=
<br>=0A=
Thanks for your time and help!<br>=0A=
<br>=0A=
Cheers=2C<br>=0A=
Martin<br>=0A=
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><div class=
=3D"ecxgmail_signature"><div dir=3D"ltr"> &nbsp=3B &nbsp=3BClaudio Martella=
<br>&nbsp=3B &nbsp=3B</div></div>=0A=
</div></div></div></div> 		 	   		  </div></body>
</html>=

--_9b7c79bc-8f01-4f9b-a21d-8367888f49b1_--