Hi list,
I have been investigating giraph for a week now and I have a huge stability
problem. I am running the trunk against a CDH3u2 hadoop cluster. The problem
I am trying to solve goes as follows:
In a graph there are 3 kinds of vertices:
1: end of the world vertices, which have only 1 connected edge
2: bivalent vertices, which have exactly 2 connected edges
3: multivalent vertices that have nedges connected (n>2)
The goal is now to calculate for each vertex that is not in category 2 all
the paths to the other reachable non bivalent vertices. One could say all
pathes between all nonbivalent vertices.
To give an example:
[9]


<12>


[5]<13>[6]<11>[7]<10>[8]
In this path I want to know that [5] forms a path to [7] via the edges <13>
and <11>. [7] forms a path via <12> with [9], via <10> with [8] and via
<11><13> with [5]. You get the idea... Directionality is not important.
The algorithm I have is pretty straight forward, in superstep 0 all vertices
that are nonbivalent send a message to their neighbours via which edge they
are reachable. In all following supersteps the bivalent vertices are simply
forwarding this information and the nonbivalent ones are terminating the
algorithm. The messages that they sent are made using Textwritable instances
encoding the path.
If I run this algorithm on input with 1 million edges it never finishes, the
master process and then the others always go out of memory, even with a 10GB
heap. I know that java programs can be memory hungry, but 10GB heap for
processing a 40MB input file is a bit to much in book. I have tried all sorts
of settings, like disabling checkpoints, but nothing makes it finish. I also
see a slowdown in processing, the first 20ish supersteps are done in no time,
but then the processing slows down until it crashes in superstep 47.
My questions are: What am I doing wrong? Do you guys have any pointers for
things to look after? How can I get this thing to finish? Why is it so memory
hungry?
Thanks a lot for your help!
AndrĂ©
