giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Baldo Faieta <bfai...@adobe.com>
Subject Adsorption on giraph - memory problems
Date Wed, 03 Oct 2012 01:01:12 GMT
Hi Everyone.

I have implemented the Adsorption algorithm (
http://rio.ecs.umass.edu/~lga<http://rio.ecs.umass.edu/~lga=>o/ece697_10/Paper/random.pdf
,
http://talukdar.net/papers/adsorption_ecml0<http://talukdar.net/papers/adsorption_ecml0=>9.pdf
)
as it seems well suited for running in giraph. I'm testing the data
with the movielens dataset  ( http://www.grouplens.org/node/73 ) and when
I run it with a small graph ( 6k nodes, 200k edges) it runs ok.

But as soon as I want to scale the graph I run into memory problems. I'm
running it with 3 processes and I have set the mapred.map.child.java.opts
variable pretty high ( 2G per process). Looking at the memory allocation
in each superstep, it seems that all the messages are allocated in memory
during a superstep before being processed and it runs out of memory
pretty quickly when I increase the size of the graph (e.g., 20k nodes,
1M edges).

The algorithm works by sending label distributions to outgoing vertices and
aggregating the distributions when it receives the messages. I have imple-
mented a combiner for the messages but it doesn't seem to help.

I think the problem is that the messages themselves, because they are dis-
tributions, they consume more memory than other examples (e.g., page rank)
and it seems that you need hefty memory allocation per process to keep all
the messages in memory before they can be processed or even combined. Is
this the case? Is there a way to be more aggressive with the combiner?
Ideally it would be great to store the messages offline until they can be
processed so as not to run into this problem. Does anyone have any
suggestions or I just have to get servers with much more memory?

BTW, if anyone is interested, I can try to post the implementation. I am
using it as a way to propagate resources to recommend to users based on
the relations of the users to the resources and the interrelations
between the resources with each other (e.g., user --viewed --> movie ,
director --directed --> movie , movie --is-genre-of --> genre, etc.)

Thanks,

Baldo


Mime
View raw message