Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B552E5A9 for ; Sat, 19 Jan 2013 03:40:50 +0000 (UTC) Received: (qmail 68022 invoked by uid 500); 19 Jan 2013 03:40:45 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 67652 invoked by uid 500); 19 Jan 2013 03:40:44 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 67626 invoked by uid 99); 19 Jan 2013 03:40:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jan 2013 03:40:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.214.181] (HELO mail-ob0-f181.google.com) (209.85.214.181) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 19 Jan 2013 03:40:36 +0000 Received: by mail-ob0-f181.google.com with SMTP id oi10so4402795obb.12 for ; Fri, 18 Jan 2013 19:40:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=gaS0XypNVaL5B6Lays0Ao8knGL0lGV1NxhHiBDu3MPQ=; b=OkmWiIhp5Fvs2QcAQ/zagaK6u3gUQ5aXDJItjZnBqabDnouulLsZ3YUUkLA5+KRijT AXWfxXqaczt4IutMQuXD5FoxTYUTr0ROndlKCoBfIO7+OCQgN7324NP8DV4WAEWQslUF 8xS1LOU1MaK7S2sdr4lLJKm5x3gRER+v7N12e4ssyS4DFbI/afSKQEn3H/Lr5EdeNlti lxFIQMHUDH4HWOJpRkEX0LPiaGsmauvhKTSHHGVkPpPrzJmgLdbZkkkhI3ED0NB2wG3E oI053Lm8+HzR/hgNY1a3/l7oD566gHPaX0rEl/dZ5ERI1ieT5vOinbz/Vw1F/ALukxEc YDGg== X-Received: by 10.182.18.133 with SMTP id w5mr8918601obd.64.1358566814714; Fri, 18 Jan 2013 19:40:14 -0800 (PST) MIME-Version: 1.0 Received: by 10.76.68.197 with HTTP; Fri, 18 Jan 2013 19:39:53 -0800 (PST) In-Reply-To: <1575764623.11273427.1358557309797.JavaMail.root@vmware.com> References: <1575764623.11273427.1358557309797.JavaMail.root@vmware.com> From: Ted Dunning Date: Fri, 18 Jan 2013 19:39:53 -0800 Message-ID: Subject: Re: Estimating disk space requirements To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=f46d043bdf28650bb404d39bfc63 X-Gm-Message-State: ALoCoQkcUE0FJLud3RO1U5kAwfUsSs23nxANgyubsHTQuGdp0YaZen61hUreeHwi6XH/cKx/bqox X-Virus-Checked: Checked by ClamAV on apache.org --f46d043bdf28650bb404d39bfc63 Content-Type: text/plain; charset=ISO-8859-1 Jeff makes some good points here. On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell wrote: > I disagree. There are some significant advantages to using "many small > nodes" instead of "few big nodes". As Ted points out, there are some > disadvantages as well, so you have to look at the trade-offs. But consider: > > - NUMA: If your hadoop nodes span physical NUMA nodes, then performance > will suffer from remote memory accesses. The Linux scheduler tries to > minimize this, but I've found that about 1/3 of memory accesses are remote > on a 2-socket machine. This effect will be more severe on bigger > machines. Hadoop nodes that fit on a NUMA node will have not access remote > memory at all (at least on vSphere). > This is definitely a good point with respect to untainted Hadoop, but with a system like MapR, there is a significant amount of core locality that goes on to minimize NUMA-remote fetches. This can have significant impact, of course. - Disk partitioning: Smaller nodes with fewer disks each can significantly > increase average disk utilization, not decrease it. Having many threads > operating against many disks in the "big node" case tends to leave some > disks idle while others are over-subscribed. > Again, this is an implementation side-effect. Good I/O scheduling and proper striping can mitigate this substantially. Going the other way, splitting disks between different VM's can be disastrous. > Partitioning disks among nodes decreases this effect. The extreme case > is one disk per node, where no disks will be idle as long as there is work > to do. > Yes. Even deficient implementations should succeed in this case. You do lose the ability to allow big-memory jobs that would otherwise span multiple slots. > - Management: Not a performance effect, but smaller nodes enable easier > multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware > with other workloads, etc. > Definitely true. --f46d043bdf28650bb404d39bfc63 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Jeff makes some good points here.

On Fri,= Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jbuell@vmware.com> = wrote:
I disagree.=A0 There are some significant ad= vantages to using "many small=20 nodes" instead of "few big nodes".=A0 As Ted points out, the= re are some=20 disadvantages as well, so you have to look at the trade-offs.=A0 But=20 consider:

- NUMA:=A0 If your hadoop nodes span physical NUMA nodes, then performance will suffer from remote memory accesses.=A0 The Linux=20 scheduler tries to minimize this, but I've found that about 1/3 of=20 memory accesses are remote on a 2-socket machine.=A0 This effect will be=20 more severe on bigger machines.=A0 Hadoop nodes that fit on a NUMA node=20 will have not access remote memory at all (at least on vSphere).
<= /div>

This is definitely a good point with = respect to untainted Hadoop, but with a system like MapR, there is a signif= icant amount of core locality that goes on to minimize NUMA-remote fetches.= =A0This can have significant impact, of course.

- Disk partitioning:=A0 Smaller nodes with fewer disks each can=20 significantly increase average disk utilization, not decrease it.=A0=20 Having many threads operating against many disks in the "big node"= ; case=20 tends to leave some disks idle while others are over-subscribed.=A0 =

Again, this is an implementati= on side-effect. =A0Good I/O scheduling and proper striping can mitigate thi= s substantially.

Going the other way, splitting disks between different = VM's can be disastrous.
=A0
= Partitioning disks among nodes decreases this effect.=A0 The extreme case= =20 is one disk per node, where no disks will be idle as long as there is=20 work to do.

Yes. =A0Even= deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise= span multiple slots.
=A0
- Management: Not a performance effect, but smaller=20 nodes enable easier multi-tenancy, multiple virtual Hadoop clusters,=20 sharing physical hardware with other workloads, etc.

Definitely true.
--f46d043bdf28650bb404d39bfc63--