Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8B83FED4D for ; Sat, 19 Jan 2013 21:20:57 +0000 (UTC) Received: (qmail 91600 invoked by uid 500); 19 Jan 2013 21:20:52 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 91520 invoked by uid 500); 19 Jan 2013 21:20:52 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 91511 invoked by uid 99); 19 Jan 2013 21:20:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jan 2013 21:20:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jeff.kubina@gmail.com designates 209.85.216.171 as permitted sender) Received: from [209.85.216.171] (HELO mail-qc0-f171.google.com) (209.85.216.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jan 2013 21:20:47 +0000 Received: by mail-qc0-f171.google.com with SMTP id d1so876275qca.2 for ; Sat, 19 Jan 2013 13:20:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=kAQW+nccJBTpmXA6v+eYHbqsbVKFN+n9ayPAHaaed08=; b=VZ6/Fw2HUmMf8e4W2qI0tsuYEN0AvkJ+srZLUpNJzkyz/vcKz2LReqbb6vA89BadXs xTYg1hpY60nc2+Ht05kDjT4/TGEs9uPUHSFhhFE9TFz5uOQyGSKIqLHH3LreWFp7VFL6 EDpCe91oAvRo1uXztc/Kv+wzfPaafB/LrldtNpP1FLVcavgO6KA704JvP0k40WT/SLDl tyK7HlGvHBUVft0yn001htHAfHWrYxMZhipo1bA3PJiiGSop3GqSH/sygoa9s+nBH/J0 UZUJmVF3U6pqOAhdpBuE8BNklZ9kuuebDnYR/4FvmCRyood5v5ly3f9k5ffDO56VrcsK tcxQ== X-Received: by 10.49.48.113 with SMTP id k17mr16430124qen.51.1358630427028; Sat, 19 Jan 2013 13:20:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.49.64.234 with HTTP; Sat, 19 Jan 2013 13:20:06 -0800 (PST) In-Reply-To: References: From: Jeff Kubina Date: Sat, 19 Jan 2013 16:20:06 -0500 Message-ID: Subject: Re: Hadoop Scalability To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b6da460fbebee04d3aacbf2 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b6da460fbebee04d3aacbf2 Content-Type: text/plain; charset=ISO-8859-1 Thiago, when addressing scaling you want to consider whether the algorithm scales, and if so, if the systems architecture enables the algorithm to scale, that is, if the algorithm scales on paper, does is also scale on the hardware? Algorithms that communicate an amount of data bounded by a constant will scale on just about any Hadoop cluster, up to a point. At about 4000 nodes the namenode server may start to become overwhelmed (a bottleneck), and slow processing down considerably. I think this bottleneck is eliminated in a not to distance release of the HDFS. If the amount of data the algorithm communicates is proportional to the number of processors (map or reduce jobs), than the network bandwidth of the cluster must increase proportional to the number of processors (since Hadoop is based on the bulk synchronous parallel model) to achieve scaling. In such cases a low bandwidth network will impede scaling. Bryan Duxbury has a nice blog post about networking a Hadoop cluster here . More concisely, I would say that "Hadoop scales on clusters with networks that scale (up to ~4000 nodes)." -- Jeff Kubina On Thu, Jan 17, 2013 at 10:09 PM, Thiago Vieira wrote: > Hello! > > Is common to see this sentence: "Hadoop Scales Linearly". But, is there > any performance evaluation to confirm this? > > In my evaluations, Hadoop processing capacity scales linearly, but not > proportional to number of nodes, the processing capacity achieved with 20 > nodes is not the double of the processing capacity achieved with 10 nodes. > Is there any evaluation about this? > > Thank you! > > -- > Thiago Vieira > --047d7b6da460fbebee04d3aacbf2 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Thiago, when addressing scaling you want to consider whether the algorithm = scales, and if so, if the systems architecture enables the algorithm to sca= le, that is, if the algorithm scales on paper, does is also scale on the ha= rdware?=A0

Algorithms that communicate an amount of data bounded by a c= onstant will scale on just about any Hadoop cluster, up to a point. At abou= t 4000 nodes the namenode server may start to become=A0overwhelmed (a bottl= eneck), and slow processing down considerably. I think this bottleneck is e= liminated in a not to distance release of the HDFS.

If the amount of data the algorithm communicates is proportional to the= number of processors (map or reduce jobs), than the network bandwidth of t= he cluster must increase proportional to the number of processors (since Ha= doop is based on the bulk synchronous parallel=A0model) to achieve scaling. In = such cases a low bandwidth network will impede scaling. Bryan Duxbury has a= nice blog post about networking a Hadoop cluster=A0here.

More concisely, I would say that "Hadoop scales on= clusters with networks that scale (up to ~4000 nodes)."
--=A0
Jeff Kubina

On Thu, Jan 17, 2013 at 10:09 PM, Thiago Vieira <tpbvieira@gmail.com= > wrote:
Hello!

Is common to see this sentence: "Hadoop Scal= es Linearly". But, is there any performance evaluation to confirm this= ?=A0

In my evaluations, Hadoop processing capacity= scales linearly, but not proportional to number of nodes, the processing c= apacity achieved with 20 nodes is not the double of the processing capacity= achieved with 10 nodes. Is there any evaluation about this?

Thank you!

--=
Thiago Vieira

--047d7b6da460fbebee04d3aacbf2--