giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From André Kelpe <efeshundert...@googlemail.com>
Subject Re: Resources or advice on minimising memory usage in Giraph/Hadoop code ?
Date Thu, 07 Jun 2012 14:21:50 GMT
Hi!

One interesting jvm option I learned about lately is
-XX:+UseCompressedStrings, which will use a byte [] for all strings,
that are fully defined in ASCII. Given that you are working with URIs,
I assume that this is true for most of your strings, so I would give
it a shot.

For more info on JVM options, please take a look here:
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

HTH

-André

2012/6/6 Benjamin Heitmann <benjamin.heitmann@deri.org>:
> Hello,
>
> can somebody recommend a web page, article or book on minimising the memory usage of
Giraph/Hadoop code ?
> I am looking for non-obvious advice on what *not* to do, and for best practices on what
to do inside of Hadoop...
>
> E.g. is it preferable to use Java Strings or Hadoop Text Writables ? Should all strings
be externalised ?
>
> Currently, I am running a Giraph job with 10 workers. Each worker has a maximum heap
of Xmx7G.
> The concurrent garbage collection is enabled. The machine has 24 cores, and 96 GB of
memory.
> The job currently uses a max of around 50 GB, so there is free memory available outside
of java.
>
> The graph itself has ~2 million vertices and ~4 million edges, which is not really "big
data".
>
> However, before starting superstep 1, I get heap space errors. Previous versions of my
algorithm where simpler,
> but they also ran into heap space errors when the data was around one order of magnitude
bigger.
>
> My suspicion is that the amount of state which my vertices have, and the amount of messages
which I am generating
> exceeds the standard use case of a pagerank rank algorithm by far.
>
> To list a few of the reasons why I need a lot of state:
>
> * I need to execute multiple runs of the same algorithm in parallel. Loading this specific
graph takes about 3 minutes,
> running the algorithm once takes about 10 seconds or so, but I have around 600 users
in that graph. And this is just a small graph,
> the whole algorithm is intended to be run for thousands of users. (... "big data"...)
>
> * The identities of the edges and vertices are not based on numbers but on strings.
> All edges and all vertices have a URI associated with them.
> The graph represents RDF data from different sources, such as DBpedia.
> In addition, most of the vertices have one or multiple types associated with them, and
> each type is again represented by a URI.
> These types are essential to the logic of the algorithm.
> I guess it would be possible to externalise all of those strings, but it adds a layer
of complexity which I had previously hoped to avoid.
>
> * As Giraph does not currently provide a central coordination point for the processing
of the graph,
> I need to send a lot of messages between vertices in order to coordinate the algorithm.
>
> * Giraph does not allow multiple Java classes to be used for different vertices in the
same graph.
> However, different vertices have different roles in my algorithm, and each role has a
different set of states in which it can be,
> due to the missing global coordination point.
>
> * Taken together, the lack of a central coordination point and the inabiltity to have
different java classes as part of the same graph,
> make the whole algorithm more similar to a network protocol and not to a graph algorithm.
Thus I need a lot of messages
> and a lot of state.
>
>
> If anybody has some good suggestion on how I should proceed, I would be very interested
in hearing them.
>
> If somebody wants to take a look at my code, then I can currently provide you with that
code in a non-public way.
>
> sincerely, Benjamin Heitmann.
>

Mime
View raw message