giraph-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Claudio Martella <claudio.marte...@gmail.com>
Subject Re: General Scalability Questions for Giraph
Date Thu, 14 Feb 2013 23:06:16 GMT
Hi Tu,

first of all, I really suggest you run trunk, especially if you have a
large graph. That being said:

1) yes and no, the jargon is misleading. you should have n - 1 workers
(what you call mappers for giraph job) with n as the max number of mappers
you can have in your cluster as an upper limit (the additional 1 goes for
the master). In general, i'd strongly suggest you have 1 mapper/worker per
node/MACHINE, and k compute threads per worker, with k as the number of
cores on that machine. You'll save netty sending messages over the loopback
and additional jvm overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also
consider the size of the messages being produced by your algorithm. E.g.
roughly, PageRank produces a double for each edge in the graph, during each
superstep.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per
machine misleads the scalability on certain aspects (such as network i/o).
I have been running Giraph jobs on hundreds of mappers and around 65
machines. I know others here have done bigger numbers (~300 workers). I'd
say the upper limit to scalability is your main memory ATM, so you might
want to have a look at out-of-core graph and messages.

Hope it helps,
Claudio


On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <mitu@paypal.com> wrote:

>  Hi,
>
>  I have some general scalability questions for Giraph. Based on the
> Giraph design, I am assuming all the mappers in giraph job should be
> running at the same time.
>
>  If so, then
>
>    1. The max mappers for giraph job <= total mapper slots in the whole
>    cluster
>    2. The max data input size to giraph should be <= total mapper slots *
>    mapper memory limit
>    3. If the total mapper slot in the cluster is 200 and only 100 mappers
>    is currently available, and the giraph job require 150 mappers
>       1. Without any configuration change, the 100 mappers of the giraph
>       will be started but the giraph job will NOT run successfully
>       2. Is there any configuration in Giraph to start the job ONLY at
>       them time when  all the mapper slot available?
>    4. How is the scalability in giraph? I can ONLY run up to 150 mappers
>    for my giraph job. Does anyone run a large giraph job in large cluster
>    successfully?
>       1. I am using giraph 0.1 in my cluster
>
>
>  Thanks a lot for your time and inputs.
>
>  Min
>



-- 
   Claudio Martella
   claudio.martella@gmail.com

Mime
View raw message