hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Distributed cache Design
Date Mon, 20 Oct 2008 10:26:41 GMT
I was very surprised by this as well.  I was doing variants on all-pairs
shortest paths and found that the best representation really was triples
containing from-node, to-node and distance.  The nice side of this is that
you get scaling like you wouldn't believe (subject to big-omega, of course)

On Thu, Oct 16, 2008 at 4:05 PM, Colin Evans <colin@metaweb.com> wrote:

> The trick is to amortize your computation over the whole set.  So DFS for a
> single node will always be faster on an in-memory graph, but Hadoop is a
> good tool for computing all-pairs shortest paths in one shot if you re-frame
> the algorithm as a belief propagation and message passing algorithm.
>
> A lot of the time, the computation still explodes into n^2 or worse, so you
> need to use a binning or blocking algorithm, like the one described here:
> http://www.youtube.com/watch?v=1ZDybXl212Q
>
> In the case of graphs, a blocking function would be to find overlapping
> strongly connected subgraphs where each subgraph fits in a reasonable amount
> of memory.  Then within each block, you do your computation and you pass a
> summary of that computation to adjacent blocks,which gets factored into the
> next computation.
>
> When we hooked up a Very Big Graph to our Hadoop cluster, we found that
> there were a lot of scaling problems, which went away when we started
> optimizing for streaming performance.
>
> -Colin
>
>
>
>
> Bhupesh Bansal wrote:
>
>> Can you elaborate here ,
>>
>> Lets say I want to implement a DFS in my graph. I am not able to picturise
>> implementing it with doing graph in pieces without putting a depth bound
>> to
>> (3-4). Lets say we have 200M (4GB) edges to start with
>>
>> Best
>> Bhupesh
>>
>>
>>
>> On 10/16/08 3:01 PM, "Owen O'Malley" <omalley@apache.org> wrote:
>>
>>
>>
>>> On Oct 16, 2008, at 1:52 PM, Bhupesh Bansal wrote:
>>>
>>>
>>>
>>>> We at Linkedin are trying to run some Large Graph Analysis problems on
>>>> Hadoop. The fastest way to run would be to keep a copy of whole
>>>> Graph in RAM
>>>> at all mappers. (Graph size is about 8G in RAM) we have cluster of 8-
>>>> cores
>>>> machine with 8G on each.
>>>>
>>>>
>>> The best way to deal with it is *not* to load the entire graph in one
>>> process. In the WebMap at Yahoo, we have a graph of the web that has
>>> roughly 1 trillion links and 100 billion nodes. See
>>> http://tinyurl.com/4fgok6
>>>  . To invert the links, you process the graph in pieces and resort
>>> based on the target. You'll get much better performance and scale to
>>> almost any size.
>>>
>>>
>>>
>>>> Whats is the best way of doing that ?? Is there a way so that multiple
>>>> mappers on same machine can access a RAM cache ??  I read about hadoop
>>>> distributed cache looks like it's copies the file (hdfs / http)
>>>> locally on
>>>> the slaves but not necessrily in RAM ??
>>>>
>>>>
>>> You could mmap the file from distributed cache using MappedByteBuffer.
>>> Then there will be one copy between jvms...
>>>
>>> -- Owen
>>>
>>>
>>
>>
>>
>
>


-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message