giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avery Ching <>
Subject Re: Suggestions on problem sizes for giraph performance benchmarking
Date Tue, 10 Jul 2012 06:56:35 GMT
You should try using the appropriate memory settings (i.e."-Xms30g -Xmx30g -Xss128k") for a 30 GB heap.  
This depends on how much memory you can get.


On 7/9/12 5:57 AM, Amani Alonazi wrote:
> Actually, I had the same problem of running out of memory with Giraph 
> when trying to implement strongly connected components algorithm on 
> Giraph. My input graph is 1 million nodes and 7 million edges.
> I'm using cluster of 21 computers.
> On Mon, Jul 9, 2012 at 3:44 PM, Benjamin Heitmann 
> < <>> wrote:
>     Hello Stephen,
>     sorry for the very late reply.
>     On 28 Jun 2012, at 02:50, Fleischman, Stephen (ISS SCI - Plano TX)
>     wrote:
>>     Hello Avery and all:
>>     I have a cluster of 10  two-processor/48 GB RAM servers, upon
>>     which we are conducting Hadoop performance characterization
>>     tests.  I plan to use the Giraph pagerank and simple shortest
>>     path example tests as part of this exercise and would appreciate
>>     guidance on problem sizes for both tests.  I’m looking at paring
>>     down an obfuscated Twitter dataset and it would save a lot of
>>     time if someone has some knowledge on roughly how the time and
>>     memory scales with number of nodes in a graph.
>     I can provide some suggestions for the kind of algorithm and data
>     which does currently surpass the scalability of giraph.
>     While the limits to my knowledge of Giraph and Hadoop are probably
>     also to blame for this, please see the recent discussions on this
>     list,
>     and on JIRA for other indications that the scalability of Giraph
>     needs improvement:
>     * post  by Yuanyuan Tian in the thread "wierd communication
>     errors" on <>
>     * GIRAPH-234 about GC overhead
>     If you want to stretch the limits of Giraph, then you need to try
>     an algorithm which is conceptually different from PageRank, and
>     you need a big data set.
>     If you use an algorithm which has complex application logic (maybe
>     even domain specific logic), which needs to be embedded in the
>     algorithm,
>     then the nodes need to have a lot of state. In addition, such
>     algorithms probably send around a lot of messages, and each of the
>     messages might have a payload
>     which is more complex then one floating point number. In addition,
>     it helps to have a graph format, which requires strings on the
>     edges and vertices.
>     The strings are required for the domain specific business logic
>     which the graph algorithm needs to follow.
>     Finally, imagine a data set which has a big loading time, and
>     where one run of the algorithm only provides results for one user.
>     The standard Hadoop paradigm is to throw away the graph after
>     loading it.
>     So if you have 100s or 1000s of users, then you need a way to
>     execute the algorithm multiple times in parallel.
>     Again this will add a lot of state, as each of the vertices will
>     need to hold one state object for each user who has visited the
>     vertex.
>     In my specific case, I had the following data and algorithm:
>     Data:
>     * an RDF graph with 10 million vertices and 40 million edges
>     I used my own import code to map the RDF graph to a undirected
>     graph with a limit of one edge between any two nodes (so it was
>     not a multi-graph)
>     * each vertex and each edge uses a string as an identity to
>     represent a URI in the RDF graph (required for the business logic
>     in the algorithm)
>     Algorithm:
>     * spreading activation.
>     You can think of it as depth first search guided by domain
>     specific logic.
>     A short introduction here:
>     The wikipedia article only mentions using spreading activation on
>     weighted graphs, however I used it on graphs which have additional
>     types on the edges.
>     The whole area of using the semantics of the edges to guide the
>     algorithm is an active research topic, so thats why I can't point
>     you to a good article on that.
>     * parallel execution:
>     I need to run the algorithm once for every user in the system,
>     however loading the data set takes around 15 minutes alone.
>     So each node has an array of states, one for each user for which
>     the algorithm has visited a node.
>     I experimented with user numbers between 30 and 1000, anything
>     more did not work for concurrent execution of the algorithm.
>     Infrastructure:
>     * a single server with 24 Intel Xeon 2.4 GHz cpus and 96 GB of RAM
>     * Hadoop 1.0, pseudo-distributed setup
>     * between 10 and 20 Giraph workers
>     A few weeks ago I stopped work on my Giraph based implementation,
>     as Giraph ran out of memory almost immediately after loading and
>     initialising the data.
>     I made sure that the Giraph workers do not run out of memory, so
>     it was probably due to IPC and messaging.
>     The general discussion on the Giraph mailing list strongly
>     indicates that I did hit the current IPC scalability limits.
>     Currently I am working on a non-Hadoop version of the algorithm
>     which is not as scalable but which is fast for *one* user. ( less
>     then a second per user, but single threaded).
>     In addition, this new version allows me to better integrate with
>     an existing ecosystem of technologies (Semantic Web technologies)
>     to which Hadoop and Giraph is currently completely disconnected.
>     However, I will probably revisit Giraph at some time on the future.
>     If you want to look at the code or the data or any other asset
>     which I have, then I will gladly share that with you.
>     I would really like Giraph to reach the maturity required for this
>     kind of algorithm.
>     However, I have the feeling that the current development focus is
>     on clear-cut numerical algorithms such as pagerank.
> -- 
> Amani AlOnazi
> MSc Computer Science
> King Abdullah University of Science and Technology
> Kingdom of Saudi Arabia
> <> | +966 
> (0) 555 191 795
> ------------------------------------------------------------------------
> This message and its contents, including attachments are intended 
> solely for the original recipient. If you are not the intended 
> recipient or have received this message in error, please notify me 
> immediately and delete this message from your computer system. Any 
> unauthorized use or distribution is prohibited. Please consider the 
> environment before printing this email. 

View raw message