incubator-giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <ach...@apache.org>
Subject Re: giraph stability problem
Date Mon, 23 Jan 2012 16:24:17 GMT
There is certainly a class of graph problems that just "blow up", by 
increasing edges until the graph is fully connected or at least have 
information that requires n^2 memory.  Unfortunately, this does require 
a lot of space (memory and disk).  Not much you can do about it except 
change your algorithm, or work on a smaller problem size.  For instance, 
the shortest paths algorithm (SimpleShortestPathsVertex) could be 
changed to find the actual shortest path to every node (but this would 
blow up memory/space as well).  Disk can help though, for some problems, 
since we typically have 1-2 more orders of magnitude of disk than memory.

Avery

On 1/23/12 7:16 AM, Claudio Martella wrote:
> There's not much you can do If you're trying an algorithm with
> exponential space complexity and you don't have enough disk space.
> What do you suggest?
>
> On Mon, Jan 23, 2012 at 4:13 PM, Deepak Nettem<deepaknettem@gmail.com>  wrote:
>> I have seen people run into this problem when doing Graph Processing
>> directly on top of Hadoop too. This kind of an approach would take
>> exponential space.
>>
>> While the proposed solution would prevent the OOM problem, eventually one
>> would run out of disk space.
>>
>>
>> On Mon, Jan 23, 2012 at 6:16 AM, Claudio Martella
>> <claudio.martella@gmail.com>  wrote:
>>> Hi,
>>>
>>> I've been struggling with a similar problem and that's why i've
>>> started working on the out-of-core message management, when the memory
>>> shrinks. Your particular problem can get to an upper bound of
>>> exponential space complexity, which you experience with the OOM. the
>>> possible paths you can extract for each source vertex are about
>>> O(d^l), where d is the average degree of the graph and l is the max
>>> length of the extracted path (if you do shortest paths, then it's the
>>> diameter).
>>>
>>> Giraph is all good and fast but it's all in-memory and for this reason
>>> it currently lacks a solution to your problem. I suggest you wait
>>> until GIRAPH-45 is ready (I should write an email tonight about that
>>> patch).
>>>
>>> Hope it makes sense to you,
>>> Claudio
>>>
>>> On Mon, Jan 23, 2012 at 10:56 AM, André Kelpe
>>> <efeshundertelf@googlemail.com>  wrote:
>>>> Hi list,
>>>>
>>>> I have been investigating giraph for a week now and I have a huge
>>>> stability
>>>> problem. I am running the trunk against a CDH3u2 hadoop cluster. The
>>>> problem
>>>> I am trying to solve goes as follows:
>>>>
>>>> In a graph there are 3 kinds of vertices:
>>>>
>>>> 1: end of the world vertices, which have only 1 connected edge
>>>> 2: bivalent vertices, which have exactly 2 connected edges
>>>> 3: multivalent vertices that have n-edges connected (n>2)
>>>>
>>>> The goal is now to calculate for each vertex that is not in category 2
>>>> all
>>>> the paths to the other reachable non bivalent vertices. One could say
>>>> all
>>>> pathes between all non-bivalent vertices.
>>>>
>>>> To give an example:
>>>>
>>>>                                         [9]
>>>>                                          |
>>>>                                          |
>>>>                                         <12>
>>>>                                          |
>>>>                                          |
>>>>       [5]-----<13>-----[6]-----<11>-----[7]-----<10>-----[8]
>>>>
>>>> In this path I want to know that [5] forms a path to [7] via the edges
>>>> <13>
>>>> and<11>. [7] forms a path via<12>  with [9], via<10>  with
[8] and via
>>>> <11><13>  with [5]. You get the idea... Directionality is not
important.
>>>>
>>>> The algorithm I have is pretty straight forward, in superstep 0 all
>>>> vertices
>>>> that are non-bivalent send a message to their neighbours via which edge
>>>> they
>>>> are reachable. In all following supersteps the bivalent vertices are
>>>> simply
>>>> forwarding this information and the non-bivalent ones are terminating
>>>> the
>>>> algorithm. The messages that they sent are made using Textwritable
>>>> instances
>>>> encoding the path.
>>>>
>>>> If I run this algorithm on input with 1 million edges it never finishes,
>>>> the
>>>> master process and then the others always go out of memory, even with a
>>>> 10GB
>>>> heap. I know that java programs can be memory hungry, but 10GB heap for
>>>> processing a 40MB input file is a bit to much in book. I have tried all
>>>> sorts
>>>> of settings, like disabling checkpoints, but nothing makes it finish. I
>>>> also
>>>> see a slowdown in processing, the first 20ish supersteps are done in no
>>>> time,
>>>> but then the processing slows down until it crashes in superstep 47.
>>>>
>>>> My questions are: What am I doing wrong? Do you guys have any pointers
>>>> for
>>>> things to look after? How can I get this thing to finish? Why is it so
>>>> memory
>>>> hungry?
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>> -André
>>>
>>>
>>> --
>>>     Claudio Martella
>>>     claudio.martella@gmail.com
>
>


Mime
View raw message