hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Colin Evans <co...@metaweb.com>
Subject Re: Distributed cache Design
Date Thu, 16 Oct 2008 23:05:46 GMT
The trick is to amortize your computation over the whole set.  So DFS 
for a single node will always be faster on an in-memory graph, but 
Hadoop is a good tool for computing all-pairs shortest paths in one shot 
if you re-frame the algorithm as a belief propagation and message 
passing algorithm.

A lot of the time, the computation still explodes into n^2 or worse, so 
you need to use a binning or blocking algorithm, like the one described 
here:  http://www.youtube.com/watch?v=1ZDybXl212Q

In the case of graphs, a blocking function would be to find overlapping 
strongly connected subgraphs where each subgraph fits in a reasonable 
amount of memory.  Then within each block, you do your computation and 
you pass a summary of that computation to adjacent blocks,which gets 
factored into the next computation.

When we hooked up a Very Big Graph to our Hadoop cluster, we found that 
there were a lot of scaling problems, which went away when we started 
optimizing for streaming performance.


Bhupesh Bansal wrote:
> Can you elaborate here ,
> Lets say I want to implement a DFS in my graph. I am not able to picturise
> implementing it with doing graph in pieces without putting a depth bound to
> (3-4). Lets say we have 200M (4GB) edges to start with
> Best
> Bhupesh
> On 10/16/08 3:01 PM, "Owen O'Malley" <omalley@apache.org> wrote:
>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
>>> We at Linkedin are trying to run some Large Graph Analysis problems on
>>> Hadoop. The fastest way to run would be to keep a copy of whole
>>> Graph in RAM
>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
>>> cores
>>> machine with 8G on each.
>> The best way to deal with it is *not* to load the entire graph in one
>> process. In the WebMap at Yahoo, we have a graph of the web that has
>> roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6
>>   . To invert the links, you process the graph in pieces and resort
>> based on the target. You'll get much better performance and scale to
>> almost any size.
>>> Whats is the best way of doing that ?? Is there a way so that multiple
>>> mappers on same machine can access a RAM cache ??  I read about hadoop
>>> distributed cache looks like it's copies the file (hdfs / http)
>>> locally on
>>> the slaves but not necessrily in RAM ??
>> You could mmap the file from distributed cache using MappedByteBuffer.
>> Then there will be one copy between jvms...
>> -- Owen

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message