The trick is to amortize your computation over the whole set. So DFS
for a single node will always be faster on an inmemory graph, but
Hadoop is a good tool for computing allpairs shortest paths in one shot
if you reframe the algorithm as a belief propagation and message
passing algorithm.
A lot of the time, the computation still explodes into n^2 or worse, so
you need to use a binning or blocking algorithm, like the one described
here: http://www.youtube.com/watch?v=1ZDybXl212Q
In the case of graphs, a blocking function would be to find overlapping
strongly connected subgraphs where each subgraph fits in a reasonable
amount of memory. Then within each block, you do your computation and
you pass a summary of that computation to adjacent blocks,which gets
factored into the next computation.
When we hooked up a Very Big Graph to our Hadoop cluster, we found that
there were a lot of scaling problems, which went away when we started
optimizing for streaming performance.
Colin
Bhupesh Bansal wrote:
> Can you elaborate here ,
>
> Lets say I want to implement a DFS in my graph. I am not able to picturise
> implementing it with doing graph in pieces without putting a depth bound to
> (34). Lets say we have 200M (4GB) edges to start with
>
> Best
> Bhupesh
>
>
>
> On 10/16/08 3:01 PM, "Owen O'Malley" <omalley@apache.org> wrote:
>
>
>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
>>
>>
>>> We at Linkedin are trying to run some Large Graph Analysis problems on
>>> Hadoop. The fastest way to run would be to keep a copy of whole
>>> Graph in RAM
>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8
>>> cores
>>> machine with 8G on each.
>>>
>> The best way to deal with it is *not* to load the entire graph in one
>> process. In the WebMap at Yahoo, we have a graph of the web that has
>> roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6
>> . To invert the links, you process the graph in pieces and resort
>> based on the target. You'll get much better performance and scale to
>> almost any size.
>>
>>
>>> Whats is the best way of doing that ?? Is there a way so that multiple
>>> mappers on same machine can access a RAM cache ?? I read about hadoop
>>> distributed cache looks like it's copies the file (hdfs / http)
>>> locally on
>>> the slaves but not necessrily in RAM ??
>>>
>> You could mmap the file from distributed cache using MappedByteBuffer.
>> Then there will be one copy between jvms...
>>
>>  Owen
>>
>
>
