giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tu, Min" <>
Subject Re: General Scalability Questions for Giraph
Date Thu, 14 Feb 2013 23:17:38 GMT
Hi Claudio,

Thank you very much for your valuable inputs. I will follow your suggestions to try giraph
0.2 ( from trunk ) and the workers setting.


From: Claudio Martella <<>>
Reply-To: "<>" <<>>
Date: Thursday, February 14, 2013 3:06 PM
To: "<>" <<>>
Subject: Re: General Scalability Questions for Giraph

Hi Tu,

first of all, I really suggest you run trunk, especially if you have a large graph. That being

1) yes and no, the jargon is misleading. you should have n - 1 workers (what you call mappers
for giraph job) with n as the max number of mappers you can have in your cluster as an upper
limit (the additional 1 goes for the master). In general, i'd strongly suggest you have 1
mapper/worker per node/MACHINE, and k compute threads per worker, with k as the number of
cores on that machine. You'll save netty sending messages over the loopback and additional
jvm overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also consider the size of
the messages being produced by your algorithm. E.g. roughly, PageRank produces a double for
each edge in the graph, during each superstep.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per machine misleads
the scalability on certain aspects (such as network i/o). I have been running Giraph jobs
on hundreds of mappers and around 65 machines. I know others here have done bigger numbers
(~300 workers). I'd say the upper limit to scalability is your main memory ATM, so you might
want to have a look at out-of-core graph and messages.

Hope it helps,

On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <<>>

I have some general scalability questions for Giraph. Based on the Giraph design, I am assuming
all the mappers in giraph job should be running at the same time.

If so, then

  1.  The max mappers for giraph job <= total mapper slots in the whole cluster
  2.  The max data input size to giraph should be <= total mapper slots * mapper memory
  3.  If the total mapper slot in the cluster is 200 and only 100 mappers is currently available,
and the giraph job require 150 mappers
     *   Without any configuration change, the 100 mappers of the giraph will be started but
the giraph job will NOT run successfully
     *   Is there any configuration in Giraph to start the job ONLY at them time when  all
the mapper slot available?
  4.  How is the scalability in giraph? I can ONLY run up to 150 mappers for my giraph job.
Does anyone run a large giraph job in large cluster successfully?
     *   I am using giraph 0.1 in my cluster

Thanks a lot for your time and inputs.


   Claudio Martella<>

View raw message