Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of jeff.kubina@gmail.com
 designates 209.85.216.171 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALeH91AD3eF5gC6Wu8RFwgfWF=CWZs5YDvAdGae-ceRfHyXhJA@mail.gmail.com>
References: 
 <CALeH91AD3eF5gC6Wu8RFwgfWF=CWZs5YDvAdGae-ceRfHyXhJA@mail.gmail.com>
From: Jeff Kubina <jeff.kubina@gmail.com>
Date: Sat, 19 Jan 2013 16:20:06 -0500
Message-ID: 
 <CA+Vtps7mMQ6aKYn=GQJ9pSXAXEsd=j8hX3iy6o143qtyobieiw@mail.gmail.com>
Subject: Re: Hadoop Scalability
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b6da460fbebee04d3aacbf2

--047d7b6da460fbebee04d3aacbf2
Content-Type: text/plain; charset=ISO-8859-1

Thiago, when addressing scaling you want to consider whether the algorithm
scales, and if so, if the systems architecture enables the algorithm to
scale, that is, if the algorithm scales on paper, does is also scale on the
hardware?

Algorithms that communicate an amount of data bounded by a constant will
scale on just about any Hadoop cluster, up to a point. At about 4000 nodes
the namenode server may start to become overwhelmed (a bottleneck), and
slow processing down considerably. I think this bottleneck is eliminated in
a not to distance release of the HDFS.

If the amount of data the algorithm communicates is proportional to the
number of processors (map or reduce jobs), than the network bandwidth of
the cluster must increase proportional to the number of processors (since
Hadoop is based on the bulk synchronous
parallel<http://en.wikipedia.org/wiki/Bulk_synchronous_parallel>
model)
to achieve scaling. In such cases a low bandwidth network will impede
scaling. Bryan Duxbury has a nice blog post about networking a Hadoop
cluster here <http://goo.gl/uVeoM>.

More concisely, I would say that "Hadoop scales on clusters with networks
that scale (up to ~4000 nodes)."
-- 
Jeff Kubina

On Thu, Jan 17, 2013 at 10:09 PM, Thiago Vieira <tpbvieira@gmail.com> wrote:

> Hello!
>
> Is common to see this sentence: "Hadoop Scales Linearly". But, is there
> any performance evaluation to confirm this?
>
> In my evaluations, Hadoop processing capacity scales linearly, but not
> proportional to number of nodes, the processing capacity achieved with 20
> nodes is not the double of the processing capacity achieved with 10 nodes.
> Is there any evaluation about this?
>
> Thank you!
>
> --
> Thiago Vieira
>

--047d7b6da460fbebee04d3aacbf2
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Thiago, when addressing scaling you want to consider whether the algorithm =
scales, and if so, if the systems architecture enables the algorithm to sca=
le, that is, if the algorithm scales on paper, does is also scale on the ha=
rdware?=A0<div>

<br></div><div>Algorithms that communicate an amount of data bounded by a c=
onstant will scale on just about any Hadoop cluster, up to a point. At abou=
t 4000 nodes the namenode server may start to become=A0overwhelmed (a bottl=
eneck), and slow processing down considerably. I think this bottleneck is e=
liminated in a not to distance release of the HDFS.<div>

<br>If the amount of data the algorithm communicates is proportional to the=
 number of processors (map or reduce jobs), than the network bandwidth of t=
he cluster must increase proportional to the number of processors (since Ha=
doop is based on the <a href=3D"http://en.wikipedia.org/wiki/Bulk_synchrono=
us_parallel">bulk synchronous parallel</a>=A0model) to achieve scaling. In =
such cases a low bandwidth network will impede scaling. Bryan Duxbury has a=
 nice blog post about networking a Hadoop cluster=A0<a href=3D"http://goo.g=
l/uVeoM">here</a>.</div>

<div><br></div><div>More concisely, I would say that &quot;Hadoop scales on=
 clusters with networks that scale (up to ~4000 nodes).&quot;</div><div><di=
v><div>--=A0</div><div>Jeff Kubina</div><div><br><div class=3D"gmail_quote"=
>

On Thu, Jan 17, 2013 at 10:09 PM, Thiago Vieira <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:tpbvieira@gmail.com" target=3D"_blank">tpbvieira@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello!<div><br></div><div>Is common to see this sentence: &quot;Hadoop Scal=
es Linearly&quot;. But, is there any performance evaluation to confirm this=
?=A0</div><div><br></div><div>In my evaluations, Hadoop processing capacity=
 scales linearly, but not proportional to number of nodes, the processing c=
apacity achieved with 20 nodes is not the double of the processing capacity=
 achieved with 10 nodes. Is there any evaluation about this?</div>


<div><br></div><div>Thank you!<br clear=3D"all"><div><div><br></div><div>--=
</div>Thiago Vieira</div>
</div>
</blockquote></div><br></div></div></div></div>

--047d7b6da460fbebee04d3aacbf2--