giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?
Date Thu, 07 Jun 2012 03:33:32 GMT
No article or book, but here's a few tips.

1) Use aggregators!  This can drastically can reduce the amount of 
memory use by combining messages on the server side.
2)"-Xss128k" or some other value (should 
affect the RPC threads or netty threads)
3) You'll want to minimize the state of every vertex as best as 
possible, perhaps creating a custom vertex.


On 6/5/12 7:38 PM, Benjamin Heitmann wrote:
> Hello,
> can somebody recommend a web page, article or book on minimising the memory usage of
Giraph/Hadoop code ?
> I am looking for non-obvious advice on what *not* to do, and for best practices on what
to do inside of Hadoop...
> E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should all strings
be externalised ?
> Currently, I am running a Giraph job with 10 workers. Each worker has a maximum heap
of Xmx7G.
> The concurrent garbage collection is enabled. The machine has 24 cores, and 96 GB of
> The job currently uses a max of around 50 GB, so there is free memory available outside
of java.
> The graph itself has ~2 million vertices and ~4 million edges, which is not really "big
> However, before starting superstep 1, I get heap space errors. Previous versions of my
algorithm where simpler,
> but they also ran into heap space errors when the data was around one order of magnitude
> My suspicion is that the amount of state which my vertices have, and the amount of messages
which I am generating
> exceeds the standard use case of a pagerank rank algorithm by far.
> To list a few of the reasons why I need a lot of state:
> * I need to execute multiple runs of the same algorithm in parallel. Loading this specific
graph takes about 3 minutes,
> running the algorithm once takes about 10 seconds or so, but I have around 600 users
in that graph. And this is just a small graph,
> the whole algorithm is intended to be run for thousands of users. (... "big data"...)
> * The identities of the edges and vertices are not based on numbers but on strings.
> All edges and all vertices have a URI associated with them.
> The graph represents RDF data from different sources, such as DBpedia.
> In addition, most of the vertices have one or multiple types associated with them, and
> each type is again represented by a URI.
> These types are essential to the logic of the algorithm.
> I guess it would be possible to externalise all of those strings, but it adds a layer
of complexity which I had previously hoped to avoid.
> * As Giraph does not currently provide a central coordination point for the processing
of the graph,
> I need to send a lot of messages between vertices in order to coordinate the algorithm.
> * Giraph does not allow multiple Java classes to be used for different vertices in the
same graph.
> However, different vertices have different roles in my algorithm, and each role has a
different set of states in which it can be,
> due to the missing global coordination point.
> * Taken together, the lack of a central coordination point and the inabiltity to have
different java classes as part of the same graph,
> make the whole algorithm more similar to a network protocol and not to a graph algorithm.
Thus I need a lot of messages
> and a lot of state.
> If anybody has some good suggestion on how I should proceed, I would be very interested
in hearing them.
> If somebody wants to take a look at my code, then I can currently provide you with that
code in a non-public way.
> sincerely, Benjamin Heitmann.

View raw message