hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ricky Ho <rickyphyl...@yahoo.com>
Subject RE: breadth-first search
Date Tue, 21 Dec 2010 03:00:53 GMT
 I also blog about how to to Single Source Shortest Path here at 

One MR algorithm is based on Dijkstra and the other is based on BFS.  I think 
the first one is more efficient than the second one.


-----Original Message-----
From: Peng, Wei [mailto:Wei.Peng@xerox.com] 
Sent: Monday, December 20, 2010 5:50 PM
To: common-user@hadoop.apache.org
Subject: breadth-first search 
 I implemented an algorithm to run hadoop on a 25GB graph data to
calculate its average separation length.
The input format is V1(tab)V2 (where V2 is the friend of V1).
My purpose is to first randomly select some seed nodes, and then for
each node, calculate the shortest paths from this node to all other
nodes on the graph.
To do this, I first run a simple python code in a single machine to get
some random seed nodes.
Then I run a hadoop job to generate adjacent list for each node as the
input for the second job.
The second job takes the adjacent list input and output the first level
breadth-first search result. The nodes which are the friends of the seed
node have distance 1. Then this output is the input for the next hadoop
job so on so forth, until all the nodes are reached.
I generated a simulated graph for testing. This data has only 100 nodes.
Normal python code can find the separation length within 1 second (100
seed nodes). However, the hadoop took almost 3 hours to do that
(pseudo-distributed mode on one machine)!!
I wonder if there is a more efficient way to do breadth-first search in
hadoop? It is very inefficient to output so many intermediate results.
Totally there would be seedNodeNumber*levelNumber+1 jobs,
seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow? 
Please help.  Thanks!


View raw message