giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Martin Junghanns <>
Subject Re: Graph partitioning and data locality
Date Wed, 05 Nov 2014 10:37:31 GMT
Thanks to you both and Lukas who answered me directly. The stated issue 
908 helps a lot and I will further look into this + physical layout.

On 04.11.2014 16:28, Pavan Kumar A wrote:
> You can also look at
> which solves the case where you have a partition map and would like 
> graph to be partitioned that way after loading the input. It does not 
> however solve the {do not shuffle data part}
> ------------------------------------------------------------------------
> From:
> Date: Tue, 4 Nov 2014 16:20:21 +0100
> Subject: Re: Graph partitioning and data locality
> To:
> Hi,
> answers are inline.
> On Tue, Nov 4, 2014 at 8:36 AM, Martin Junghanns 
> < <>> wrote:
>     Hi group,
>     I got a question concerning the graph partitioning step. If I
>     understood the code correctly, the graph is distributed to n
>     partitions by using vertexID.hashCode() & n. I got two questions
>     concerning that step.
>     1) Is the whole graph loaded and partitioned only by the Master?
>     This would mean, the whole data has to be moved to that Master map
>     job and then moved to the physical node the specific worker for
>     the partition runs on. As this sounds like a huge overhead, I
>     further inspected the code:
>     I saw that there is also a WorkerGraphPartitioner and I assume he
>     calls the partitioning method on his local data (lets say his
>     local HDFS blocks) and if the resulting partition for a vertex is
>     not himself, the data gets moved to that worker, which reduces the
>     overhead. Is this assumption correct?
> That is correct, workers forward vertex data to the correct worker who 
> is responsible for that vertex via hash-partitioning (by default), 
> meaning that the master is not involved.
>     2) Let's say the graph is already partitioned in the file system,
>     e.g. blocks on physical nodes contain logical connected graph
>     nodes. Is it possible to just read the data as it is and skip the
>     partitioning step? In that case I currently assume, that the
>     vertexID should contain the partitionID and the custom
>     partitioning would be an identity function in that case (instead
>     of hashing or range).
> In principle you can. You would need to organize splits so that they 
> contain all the data for each particular worker, and then assign 
> relevant splits to the corresponding worker.
>     Thanks for your time and help!
>     Cheers,
>     Martin
> -- 
>    Claudio Martella

View raw message