Mailing-List: contact user-help@giraph.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@giraph.apache.org
Received-SPF: pass (nike.apache.org: domain of mitu@paypal.com designates
 193.28.178.24 as permitted sender)
DomainKey-Signature: s=paypalcorp; d=paypal.com; c=nofws; q=dns;
  h=X-EBay-Corp:X-IronPort-AV:Received:Received:From:To:
   Subject:Thread-Topic:Thread-Index:Date:Message-ID:
   In-Reply-To:Accept-Language:Content-Language:
   X-MS-Has-Attach:X-MS-TNEF-Correlator:x-originating-ip:
   Content-Type:MIME-Version:X-CFilter-Loop;
  b=We2BMieKx1LBS+tlkerD9jpKCDYR/btyf7syEAt5Z48PKEslkNZV6PmH
   eXiOfb/1mym/AEbjpAj8y28ygxC/ZY69aNmcCOwteLhRDYNop0gR9lucm
   HUVJqdQSCcJgaujVTITfNx0dlKNzGcagpXSO68AWyNUcWEOWlls8MmWZJ
   Y=;
From: "Tu, Min" <mitu@paypal.com>
To: "user@giraph.apache.org" <user@giraph.apache.org>
Subject: Re: General Scalability Questions for Giraph
Thread-Topic: General Scalability Questions for Giraph
Thread-Index: AQHOCwWzKuKXRuWlhkevKoSyhl2x5Jh6f9kA//99DYA=
Date: Thu, 14 Feb 2013 23:17:38 +0000
Message-ID: 
 <345801A3A7546D488A0CE4001E4FFB280838B112@RHV-EXRDA-S11.corp.ebay.com>
In-Reply-To: 
 <CAFJOoJffqKN9PW6wpNk7c7BsS4cjJQnPD2vL3KP063qQJtEp0w@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: multipart/alternative;
	boundary="_000_345801A3A7546D488A0CE4001E4FFB280838B112RHVEXRDAS11corp_"
MIME-Version: 1.0

--_000_345801A3A7546D488A0CE4001E4FFB280838B112RHVEXRDAS11corp_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi Claudio,

Thank you very much for your valuable inputs. I will follow your suggestion=
s to try giraph 0.2 ( from trunk ) and the workers setting.

Min

From: Claudio Martella <claudio.martella@gmail.com<mailto:claudio.martella@=
gmail.com>>
Reply-To: "user@giraph.apache.org<mailto:user@giraph.apache.org>" <user@gir=
aph.apache.org<mailto:user@giraph.apache.org>>
Date: Thursday, February 14, 2013 3:06 PM
To: "user@giraph.apache.org<mailto:user@giraph.apache.org>" <user@giraph.ap=
ache.org<mailto:user@giraph.apache.org>>
Subject: Re: General Scalability Questions for Giraph

Hi Tu,

first of all, I really suggest you run trunk, especially if you have a larg=
e graph. That being said:

1) yes and no, the jargon is misleading. you should have n - 1 workers (wha=
t you call mappers for giraph job) with n as the max number of mappers you =
can have in your cluster as an upper limit (the additional 1 goes for the m=
aster). In general, i'd strongly suggest you have 1 mapper/worker per node/=
MACHINE, and k compute threads per worker, with k as the number of cores on=
 that machine. You'll save netty sending messages over the loopback and add=
itional jvm overhead.

2) yes, but I challenge you to compute those sizes before hand :) Also cons=
ider the size of the messages being produced by your algorithm. E.g. roughl=
y, PageRank produces a double for each edge in the graph, during each super=
step.

3) AFAIK there's no way, but I might be wrong here.

4) I'd suggest you also talk in terms of nodes. Having multiple workers per=
 machine misleads the scalability on certain aspects (such as network i/o).=
 I have been running Giraph jobs on hundreds of mappers and around 65 machi=
nes. I know others here have done bigger numbers (~300 workers). I'd say th=
e upper limit to scalability is your main memory ATM, so you might want to =
have a look at out-of-core graph and messages.

Hope it helps,
Claudio


On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <mitu@paypal.com<mailto:mitu@payp=
al.com>> wrote:
Hi,

I have some general scalability questions for Giraph. Based on the Giraph d=
esign, I am assuming all the mappers in giraph job should be running at the=
 same time.

If so, then

  1.  The max mappers for giraph job <=3D total mapper slots in the whole c=
luster
  2.  The max data input size to giraph should be <=3D total mapper slots *=
 mapper memory limit
  3.  If the total mapper slot in the cluster is 200 and only 100 mappers i=
s currently available, and the giraph job require 150 mappers
     *   Without any configuration change, the 100 mappers of the giraph wi=
ll be started but the giraph job will NOT run successfully
     *   Is there any configuration in Giraph to start the job ONLY at them=
 time when  all the mapper slot available?
  4.  How is the scalability in giraph? I can ONLY run up to 150 mappers fo=
r my giraph job. Does anyone run a large giraph job in large cluster succes=
sfully?
     *   I am using giraph 0.1 in my cluster

Thanks a lot for your time and inputs.

Min


--
   Claudio Martella
   claudio.martella@gmail.com<mailto:claudio.martella@gmail.com>

--_000_345801A3A7546D488A0CE4001E4FFB280838B112RHVEXRDAS11corp_
Content-Type: text/html; charset="us-ascii"
Content-ID: <6E87923E08244E48AF61C2E150F06B09@corp.ebay.com>
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
</head>
<body style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-lin=
e-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-fami=
ly: Calibri, sans-serif; ">
<div>Hi Claudio,</div>
<div><br>
</div>
<div>Thank you very much for your valuable inputs. I will follow your sugge=
stions to try giraph 0.2 ( from trunk ) and the workers setting.</div>
<div><br>
</div>
<div>Min</div>
<div><br>
</div>
<span id=3D"OLK_SRC_BODY_SECTION">
<div style=3D"font-family:Calibri; font-size:11pt; text-align:left; color:b=
lack; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM:=
 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid;=
 BORDER-RIGHT: medium none; PADDING-TOP: 3pt">
<span style=3D"font-weight:bold">From: </span>Claudio Martella &lt;<a href=
=3D"mailto:claudio.martella@gmail.com">claudio.martella@gmail.com</a>&gt;<b=
r>
<span style=3D"font-weight:bold">Reply-To: </span>&quot;<a href=3D"mailto:u=
ser@giraph.apache.org">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mail=
to:user@giraph.apache.org">user@giraph.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Date: </span>Thursday, February 14, 2013 3=
:06 PM<br>
<span style=3D"font-weight:bold">To: </span>&quot;<a href=3D"mailto:user@gi=
raph.apache.org">user@giraph.apache.org</a>&quot; &lt;<a href=3D"mailto:use=
r@giraph.apache.org">user@giraph.apache.org</a>&gt;<br>
<span style=3D"font-weight:bold">Subject: </span>Re: General Scalability Qu=
estions for Giraph<br>
</div>
<div><br>
</div>
<div>
<div>
<div dir=3D"ltr">Hi Tu,
<div><br>
</div>
<div>first of all, I really suggest you run trunk, especially if you have a=
 large graph. That being said:</div>
<div><br>
</div>
<div style=3D"">1) yes and no, the jargon is misleading. you should have n =
- 1 workers (what you call mappers for giraph job) with n as the max number=
 of mappers you can have in your cluster as an upper limit (the additional =
1 goes for the master). In general,
 i'd strongly suggest you have 1 mapper/worker per node/MACHINE, and k comp=
ute threads per worker, with k as the number of cores on that machine. You'=
ll save netty sending messages over the loopback and additional jvm overhea=
d.</div>
<div style=3D""><br>
</div>
<div style=3D"">2) yes, but I challenge you to compute those sizes before h=
and :) Also consider the size of the messages being produced by your algori=
thm. E.g. roughly, PageRank produces a double for each edge in the graph, d=
uring each superstep.</div>
<div style=3D""><br>
</div>
<div style=3D"">3) AFAIK there's no way, but I might be wrong here.</div>
<div style=3D""><br>
</div>
<div style=3D"">4) I'd suggest you also talk in terms of nodes. Having mult=
iple workers per machine misleads the scalability on certain aspects (such =
as network i/o). I have been running Giraph jobs on hundreds of mappers and=
 around 65 machines. I know others
 here have done bigger numbers (~300 workers). I'd say the upper limit to s=
calability is your main memory ATM, so you might want to have a look at out=
-of-core graph and messages.</div>
<div style=3D""><br>
</div>
<div style=3D"">Hope it helps,</div>
<div style=3D"">Claudio</div>
</div>
<div class=3D"gmail_extra"><br>
<br>
<div class=3D"gmail_quote">On Thu, Feb 14, 2013 at 11:50 PM, Tu, Min <span =
dir=3D"ltr">
&lt;<a href=3D"mailto:mitu@paypal.com" target=3D"_blank">mitu@paypal.com</a=
>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div style=3D"font-size:14px;font-family:Calibri,sans-serif;word-wrap:break=
-word">
<div>Hi,</div>
<div><br>
</div>
<div>I have some general scalability questions for Giraph. Based on the Gir=
aph design, I am assuming all the mappers in giraph job should be running a=
t the same time.</div>
<div><br>
</div>
<div>If so, then</div>
<ol>
<li>The max mappers for giraph job &lt;=3D total mapper slots in the whole =
cluster</li><li>The max data input size to giraph should be &lt;=3D total m=
apper slots * mapper memory limit</li><li>If the total mapper slot in the c=
luster is 200 and only 100 mappers is currently available, and the giraph j=
ob require 150 mappers
<ol>
<li>Without any configuration change, the 100 mappers of the giraph will be=
 started but the giraph job will NOT run successfully</li><li>Is there any =
configuration in Giraph to start the job ONLY at them time when &nbsp;all t=
he mapper slot available?
</li></ol>
</li><li>How is the scalability in giraph? I can ONLY run up to 150 mappers=
 for my giraph job. Does anyone run a large giraph job in large cluster suc=
cessfully?
<ol>
<li>I am using giraph 0.1 in my cluster</li></ol>
</li></ol>
<div><br>
</div>
<div>Thanks a lot for your time and inputs.</div>
<span class=3D"HOEnZb"><font color=3D"#888888">
<div><br>
</div>
<div>Min</div>
</font></span></div>
</blockquote>
</div>
<br>
<br clear=3D"all">
<div><br>
</div>
-- <br>
&nbsp; &nbsp;Claudio Martella<br>
&nbsp; &nbsp;<a href=3D"mailto:claudio.martella@gmail.com" target=3D"_blank=
">claudio.martella@gmail.com</a>&nbsp; &nbsp;
</div>
</div>
</div>
</span>
</body>
</html>

--_000_345801A3A7546D488A0CE4001E4FFB280838B112RHVEXRDAS11corp_--