hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Loddengaard <a...@cloudera.com>
Subject Re: hadoop hardware configuration
Date Wed, 27 May 2009 19:39:16 GMT
Answers in-line.


On Wed, May 27, 2009 at 6:50 AM, Patrick Angeles

> Hey all,
> I'm trying to find some up-to-date hardware advice for building a Hadoop
> cluster. I've only been able to dig up the following links. Given Moore's
> law, these are already out of date:
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200811.mbox/%3CA47C361B-D19B-4A61-8DC1-41D4C0975EC8@cse.unl.edu%3E
> http://wiki.apache.org/hadoop/MachineScaling
> We expect to be taking in roughly 50GB of log data per day. In the early
> going, we can choose to retain the logs for only a short period after
> processing, so we can start with a small cluster (around 6 task nodes).
> However, at some point, we will want to retain up to a year's worth of raw
> data (~14TB per year).
> We will likely be using Hive/Pig and Mahout for cluster analysis.
> Given this, I'd like to run by the following machine specs to see what
> everyone thinks:
> 2 x Hadoop Master (and Secondary NameNode)
>   - 2 x 2.3Ghz Quad Core (Low Power Opteron -- 2376 HE @ 55W)
>   - 16GB DDR2-800 Registered ECC Memory
>   - 4 x 1TB 7200rpm SATA II Drives
>   - Hardware RAID controller
>   - Redundant Power Supply
>   - Approx. 390W power draw (1.9amps 208V)
>   - Approx. $4000 per unit
> 6 x Hadoop Task Nodes
>   - 1 x 2.3Ghz Quad Core (Opteron 1356)
>   - 8GB DDR2-800 Registered ECC Memory
>   - 4 x 1TB 7200rpm SATA II Drives
>   - No RAID (JBOD)
>   - Non-Redundant Power Supply
>   - Approx. 210W power draw (1.0amps 208V)
>   - Approx. $2000 per unit

If you can swing it, I'd recommend going with eight cores and 16 GBs of
memory, unless you expect your jobs to be IO bound.  Really the ratio of
disks to CPU+RAM should be matched to the types of jobs you're running.
That said, doubling your cores and memory gives you a little more breathing
room, and is relatively cheap in the grand scheme of things.

> I had some specific questions regarding this configuration...
>   1. Is hardware RAID necessary for the master node?

No.  Just make sure you configure Hadoop to write NN metadata to each disk,
including a NFS mount.  The NN will write in parallel to all its configured

>   2. What is a good processor-to-storage ratio for a task node with 4TB of
>   raw storage? (The config above has 1 core per 1TB of raw storage.)

Again,I would recommend doubling your cores and memory.

>   3. Am I better off using dual quads for a task node, with a higher power
>   draw? Dual quad task node with 16GB RAM and 4TB storage costs roughly
> $3200,
>   but draws almost 2x as much power. The tradeoffs are:
>      1. I will get more CPU per dollar and per watt.
>      2. I will only be able to fit 1/2 as much dual quad machines into a
>      rack.
>      3. I will get 1/2 the storage capacity per watt.
>      4. I will get less I/O throughput overall (less spindles per core)
>   4. In planning storage capacity, how much spare disk space should I take
>   into account for 'scratch'? For now, I'm assuming 1x the input data size.

What do you define as scratch?  Do you mean mapper intermediate data?  If
that's what you mean, then you should assume a fair amount.  If you install
your OS on one partition, and devote all other partitions and disks to HDFS,
Hadoop will do something reasonable with regard to DFS data, MR temporary
data, etc.

> Thanks in advance,
> - P

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message