I also blog about how to to Single Source Shortest Path here at
http://horicky.blogspot.com/2010/02/nosqlgraphdb.html
One MR algorithm is based on Dijkstra and the other is based on BFS. I think
the first one is more efficient than the second one.
Rgds,
Ricky
Original Message
From: Peng, Wei [mailto:Wei.Peng@xerox.com]
Sent: Monday, December 20, 2010 5:50 PM
To: commonuser@hadoop.apache.org
Subject: breadthfirst search
I implemented an algorithm to run hadoop on a 25GB graph data to
calculate its average separation length.
The input format is V1(tab)V2 (where V2 is the friend of V1).
My purpose is to first randomly select some seed nodes, and then for
each node, calculate the shortest paths from this node to all other
nodes on the graph.
To do this, I first run a simple python code in a single machine to get
some random seed nodes.
Then I run a hadoop job to generate adjacent list for each node as the
input for the second job.
The second job takes the adjacent list input and output the first level
breadthfirst search result. The nodes which are the friends of the seed
node have distance 1. Then this output is the input for the next hadoop
job so on so forth, until all the nodes are reached.
I generated a simulated graph for testing. This data has only 100 nodes.
Normal python code can find the separation length within 1 second (100
seed nodes). However, the hadoop took almost 3 hours to do that
(pseudodistributed mode on one machine)!!
I wonder if there is a more efficient way to do breadthfirst search in
hadoop? It is very inefficient to output so many intermediate results.
Totally there would be seedNodeNumber*levelNumber+1 jobs,
seedNodeNumber*levelNumber intermediate files. Why is hadoop so slow?
Please help. Thanks!
Wei
