Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <1575764623.11273427.1358557309797.JavaMail.root@vmware.com>
References: 
 <CAND0qzsJiY09+6O+c88RCjmzM1-RsbBVY1gSX2po45dAkG5sBA@mail.gmail.com>
 <1575764623.11273427.1358557309797.JavaMail.root@vmware.com>
From: Ted Dunning <tdunning@maprtech.com>
Date: Fri, 18 Jan 2013 19:39:53 -0800
Message-ID: 
 <CAND0qzvm+Fr0-gUUd+9+keXWtY5-5TA5MAee+jeOi5i7fAx=2A@mail.gmail.com>
Subject: Re: Estimating disk space requirements
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d043bdf28650bb404d39bfc63

--f46d043bdf28650bb404d39bfc63
Content-Type: text/plain; charset=ISO-8859-1

Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jbuell@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
>
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).
>

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.
>

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be
disastrous.


>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.
>

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.


> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.
>

Definitely true.

--f46d043bdf28650bb404d39bfc63
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Jeff makes some good points here.<br><br><div class=3D"gmail_quote">On Fri,=
 Jan 18, 2013 at 5:01 PM, Jeffrey Buell <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:jbuell@vmware.com" target=3D"_blank">jbuell@vmware.com</a>&gt;</span> =
wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div><div style=3D"font-size:12pt;font-famil=
y:tahoma,new york,times,serif">I disagree.=A0 There are some significant ad=
vantages to using &quot;many small=20
nodes&quot; instead of &quot;few big nodes&quot;.=A0 As Ted points out, the=
re are some=20
disadvantages as well, so you have to look at the trade-offs.=A0 But=20
consider:<br><br>- NUMA:=A0 If your hadoop nodes span physical NUMA nodes,
 then performance will suffer from remote memory accesses.=A0 The Linux=20
scheduler tries to minimize this, but I&#39;ve found that about 1/3 of=20
memory accesses are remote on a 2-socket machine.=A0 This effect will be=20
more severe on bigger machines.=A0 Hadoop nodes that fit on a NUMA node=20
will have not access remote memory at all (at least on vSphere).<br></div><=
/div></blockquote><div><br></div><div>This is definitely a good point with =
respect to untainted Hadoop, but with a system like MapR, there is a signif=
icant amount of core locality that goes on to minimize NUMA-remote fetches.=
 =A0This can have significant impact, of course.</div>

<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex=
;border-left:1px #ccc solid;padding-left:1ex"><div><div style=3D"font-size:=
12pt;font-family:tahoma,new york,times,serif"><span style=3D"font-size:12pt=
">-
 Disk partitioning:=A0 Smaller nodes with fewer disks each can=20
significantly increase average disk utilization, not decrease it.=A0=20
Having many threads operating against many disks in the &quot;big node&quot=
; case=20
tends to leave some disks idle while others are over-subscribed.=A0 </span>=
</div></div></blockquote><div><br></div><div>Again, this is an implementati=
on side-effect. =A0Good I/O scheduling and proper striping can mitigate thi=
s substantially.</div>

<div><br></div><div>Going the other way, splitting disks between different =
VM&#39;s can be disastrous.</div><div>=A0</div><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">

<div><div style=3D"font-size:12pt;font-family:tahoma,new york,times,serif">=
<span style=3D"font-size:12pt">
Partitioning disks among nodes decreases this effect.=A0 The extreme case=
=20
is one disk per node, where no disks will be idle as long as there is=20
work to do.</span></div></div></blockquote><div><br></div><div>Yes. =A0Even=
 deficient implementations should succeed in this case.</div><div><br></div=
><div>You do lose the ability to allow big-memory jobs that would otherwise=
 span multiple slots.</div>

<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex"><div><div style=3D"font-size:1=
2pt;font-family:tahoma,new york,times,serif"><span style=3D"font-size:12pt"=
>- Management: Not a performance effect, but smaller=20
nodes enable easier multi-tenancy, multiple virtual Hadoop clusters,=20
sharing physical hardware with other workloads, etc.</span></div></div></bl=
ockquote><div><br></div><div>Definitely true.</div></div>

--f46d043bdf28650bb404d39bfc63--