giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pavan Kumar A <pava...@outlook.com>
Subject RE: Graph partitioning and data locality
Date Tue, 04 Nov 2014 15:28:16 GMT
You can also look at https://issues.apache.org/jira/browse/GIRAPH-908which solves the case
where you have a partition map and would like graph to be partitioned that way after loading
the input. It does not however solve the {do not shuffle data part}

From: claudio.martella@gmail.com
Date: Tue, 4 Nov 2014 16:20:21 +0100
Subject: Re: Graph partitioning and data locality
To: user@giraph.apache.org

Hi,
answers are inline.
On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns <martin.junghanns@gmx.net> wrote:
Hi group,



I got a question concerning the graph partitioning step. If I understood the code correctly,
the graph is distributed to n partitions by using vertexID.hashCode() & n. I got two questions
concerning that step.



1) Is the whole graph loaded and partitioned only by the Master? This would mean, the whole
data has to be moved to that Master map job and then moved to the physical node the specific
worker for the partition runs on. As this sounds like a huge overhead, I further inspected
the code:

I saw that there is also a WorkerGraphPartitioner and I assume he calls the partitioning method
on his local data (lets say his local HDFS blocks) and if the resulting partition for a vertex
is not himself, the data gets moved to that worker, which reduces the overhead. Is this assumption
correct?

That is correct, workers forward vertex data to the correct worker who is responsible for
that vertex via hash-partitioning (by default), meaning that the master is not involved. 


2) Let's say the graph is already partitioned in the file system, e.g. blocks on physical
nodes contain logical connected graph nodes. Is it possible to just read the data as it is
and skip the partitioning step? In that case I currently assume, that the vertexID should
contain the partitionID and the custom partitioning would be an identity function in that
case (instead of hashing or range).

In principle you can. You would need to organize splits so that they contain all the data
for each particular worker, and then assign relevant splits to the corresponding worker. 


Thanks for your time and help!



Cheers,

Martin



-- 
    Claudio Martella
   
 		 	   		  
Mime
View raw message