hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <omal...@apache.org>
Subject Re: Distributed cache Design
Date Thu, 16 Oct 2008 22:01:30 GMT

On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:

> We at Linkedin are trying to run some Large Graph Analysis problems on
> Hadoop. The fastest way to run would be to keep a copy of whole  
> Graph in RAM
> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8- 
> cores
> machine with 8G on each.

The best way to deal with it is *not* to load the entire graph in one  
process. In the WebMap at Yahoo, we have a graph of the web that has  
roughly 1 trillion links and 100 billion nodes. See http://tinyurl.com/4fgok6 
  . To invert the links, you process the graph in pieces and resort  
based on the target. You'll get much better performance and scale to  
almost any size.

> Whats is the best way of doing that ?? Is there a way so that multiple
> mappers on same machine can access a RAM cache ??  I read about hadoop
> distributed cache looks like it's copies the file (hdfs / http)  
> locally on
> the slaves but not necessrily in RAM ??

You could mmap the file from distributed cache using MappedByteBuffer.  
Then there will be one copy between jvms...

-- Owen

View raw message