giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Question about range partitioner and data locality
Date Tue, 29 May 2012 22:45:40 GMT

Solving this issue will be a nice contribution.  Here's one idea (you 
may have a better one):

First off, you could implement 
MasterGraphPartitioner#createInitialPartitionOwners() such that the 
partitions align with your input splits.

After that, we need a way to perhaps specify a locality to a particular 
worker.  You could take a part of the 
BspServiceWorker#reserveInputSplit() method to be a user specified class 
and method to do the reservation, given some information (i.e. all the 
input splits).  Then you could implement a reservation algorithm that 
perhaps tried to load on the splits that met the ranges it had.  This a 
little hand-wavey as the interfaces need to be figured out.

What do you think?


On 5/25/12 5:00 PM, Yuanyuan Tian wrote:
> Avery,
> I think I didn't make myself very clear in the first email. I have 
> already wrote a range based partitioner, and it works. But exactly as 
> you said, the vertices shipped is pretty much the same as the hash 
> partitioner. Actually the vertices loading time is a bit slower than 
> hash partitioner, because it takes a bit more time to check for the 
> partition id for each vertex. I did observe the reduction of the # of 
> messages in the giraph job.
> Now, what I want to do is to reduce the loading time. I have 
> preprocessed the input graph so that data is divided into n files (n 
> is the number of workers I want to use for my giraph job later), each 
> file contains a few range-based partitions.  I know the partition 
> ranges and which file each partition belongs to before I run my giraph 
> job. I want a new partitioner so that each worker will read local data 
> without checking for partitionID and use the ranges of the local data 
> to register as the partitions it is responsible for. This way the 
> loading phase doesn't need to check for partitionid for each vertex 
> and it doesn't need to ship vertices to other workers either. I 
> understand this will be a special paritioner only used when the input 
> data is very well organized. My question is how I can achieve this.
> Yuanyuan
> From: Avery Ching <>
> To:
> Cc: Yuanyuan Tian/Almaden/IBM@IBMUS
> Date: 05/25/2012 11:10 AM
> Subject: Re: Question about range partitioner and data locality
> ------------------------------------------------------------------------
> Writing a range based partitioner is for potentially reducing the 
> number of messages between workers (i.e. reverse lexical ordering of 
> urls for page rank).  Without changes to the input splits loading, the 
> average number vertices shipped during the input superstep will be the 
> same as the using the hash partitioner.  Is this what you are trying 
> to achieve?
> Avery
> On 5/25/12 10:57 AM, Yuanyuan Tian wrote:
> I am not suggesting to change the current range partitioner, as it is 
> designed for a general case. I want to write a special partitioner 
> based on the existing range partitioner to achieve what I want to do 
> in this special situation, but I don't know how.
> Yuanyuan
> -----Avery Ching _<>_ 
> <>wrote: -----
> To: _user@giraph.apache.org_ <>
> From: Avery Ching _<>_ <>
> Date: 05/24/2012 11:59PM
> Subject: Re: Question about range partitioner and data locality
> You are definitely right that the old version of Giraph supported 
> ranges pretty well for loading, but could not support hash based 
> distribution (much better for memory distribution across workers).  It 
> also made a lot of assumptions (the data within each split was in a 
> unique range and sorted).
> Unless we make these type of assumptions, it would be pretty hard to 
> do.  One way might be to have all the workers examine each input 
> split, and each input split would provide on information as to its 
> range.  If the worker matches that range, it would attempt to load 
> some or all of the vertices in that split.  Otherwise, it would try 
> the next split.
> Any other ideas?
> Avery
> On 5/23/12 5:36 PM, Yuanyuan Tian wrote:
> Hi,
> I want to use better partitions of input graph for my algorithm 
> running on Giraph. So, I partitioned my input graph and re-labeled the 
> vertex ids so that vertex ids of the same partition are in a 
> consecutive range. I also reorganized the input file so that the 
> vertices in the same range are together. I used the range partitioner 
> for the Giraph job to utilize the better partitions. However, the 
> vertex loader still looks for the partition id of each vertex and ship 
> it to the worker that owns the partition. On the other hand, I have 
> already prepared my data in a nice way, in the ideal case, I can just 
> keep all the vertices of an inputsplit local to the corresponding 
> worker. Is there an easy way to do this? I know that in the very old 
> version of giraph, giraph doesn't have a partitioner. The users have 
> to prepare the partitions. I essentially want to do a similar thing in 
> the current version of giraph. Please give me a pointer or two on how 
> to do this.
> Thanks,
> Yuanyuan

View raw message