hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer <awittena...@linkedin.com>
Subject Re: Cluster Machines
Date Tue, 03 Nov 2009 18:20:25 GMT

On 11/3/09 5:25 AM, "John Martyniak" <john@beforedawnsolutions.com> wrote:
> 1) Should each node have RAID 1, or is it sufficient to have HDFS take
> care of that?  Because for each node I could put a 80 GB drive for the
> boot drive and leave one of the 1.5 TBs for the data drive that Hadoop
> uses.

HDFS will take care of the data fidelity assuming reasonable values of
replication are configured.

The reality tho is that you probably want to use the boot drive for data
storage as well.  You'll never do an OS upgrade on compute node; you're
going to blow it away completely.  So the old ops adage about separating OS
and application doesn't really apply.

>  2) As I grow the system is it necessary to have all nodes with the
> same config? 


> Is there any benefit or problem either way.  The way
> that I have been approaching it, is to get nodes that I can get the
> best deal on that have decent performance.  So if future boxes have
> Dual or Quad core, will that cause some problem, management or
> otherwise.

The biggest issues revolve around Hadoop configuration itself.  It is fairly
easy to push out one config to all of your compute nodes.  As the number of
different configuration multiplies (and you'll likely end up with at least 3
off the bat for simple configs: NN/JT, DN/TT and client), this gets to be
harder and harder to maintain.  [... and Hadoop doesn't lift a finger to
help, really.]

> 3) For the Hard disk sizes if some of the boxes have 1.5 TBs and other
> boxes have say 300 GB, will HDFS have an issue managing that?

No, but you will. :)

The biggest problem isn't HDFS as much as it is MapReduce.  MR can spill to
disk without bounds.  This means that if the jobs themselves aren't well
behaved/well tuned, you may end up having MR eat all the space that was
capacity planned for the HDFS.

Here at LI, we're using ZFS so we just created a dedicated MR file system
with a fixed size from each of our disk pools then configured MR to use
that. If the size we picked is too big, we can always reset the quota and
return that space to the HDFS.  So while user jobs may blow each other up,
at least the HDFS will safe.  In a non-pooled storage world, you lack a bit
of the flexibility, but the concept is the same.  Just create a dedicated
slice for MR and be on your way. :)

View raw message