hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@maprtech.com>
Subject Re: Estimating disk space requirements
Date Sat, 19 Jan 2013 03:39:53 GMT
Jeff makes some good points here.

On Fri, Jan 18, 2013 at 5:01 PM, Jeffrey Buell <jbuell@vmware.com> wrote:

> I disagree.  There are some significant advantages to using "many small
> nodes" instead of "few big nodes".  As Ted points out, there are some
> disadvantages as well, so you have to look at the trade-offs.  But consider:
> - NUMA:  If your hadoop nodes span physical NUMA nodes, then performance
> will suffer from remote memory accesses.  The Linux scheduler tries to
> minimize this, but I've found that about 1/3 of memory accesses are remote
> on a 2-socket machine.  This effect will be more severe on bigger
> machines.  Hadoop nodes that fit on a NUMA node will have not access remote
> memory at all (at least on vSphere).

This is definitely a good point with respect to untainted Hadoop, but with
a system like MapR, there is a significant amount of core locality that
goes on to minimize NUMA-remote fetches.  This can have significant impact,
of course.

- Disk partitioning:  Smaller nodes with fewer disks each can significantly
> increase average disk utilization, not decrease it.  Having many threads
> operating against many disks in the "big node" case tends to leave some
> disks idle while others are over-subscribed.

Again, this is an implementation side-effect.  Good I/O scheduling and
proper striping can mitigate this substantially.

Going the other way, splitting disks between different VM's can be

>  Partitioning disks among nodes decreases this effect.  The extreme case
> is one disk per node, where no disks will be idle as long as there is work
> to do.

Yes.  Even deficient implementations should succeed in this case.

You do lose the ability to allow big-memory jobs that would otherwise span
multiple slots.

> - Management: Not a performance effect, but smaller nodes enable easier
> multi-tenancy, multiple virtual Hadoop clusters, sharing physical hardware
> with other workloads, etc.

Definitely true.

View raw message