hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "MachineScaling" by cfellows
Date Fri, 14 Dec 2007 15:47:34 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by cfellows:

- == Machine Scaling ==
+ Among the software questions for setting up and running Hadoop, there a few other questions
that relate to hardware scaling:
+  1. What are the optimum machine configurations for running a hadoop cluster? 
+  1. Should I use a smaller number of high end/performance machines or are a larger number
of "commodity" machines? 
+  1. How does the Hadoop/Parallel Distributed Processing community define "commodity"?
+ '''Note:''' The initial section of this page will focus on datanodes.
+ In answer to 1 and 2 above, we can group the possible hardware options in to 3 rough categories:
+  A. database class machine with many (>10) fast SAS drives and >10GB memory, dual
or quad x quad core cpu's. With an approximate cost of $20K.
+  A. generic production machine with 2 x 250GB SATA drives, 4-12GB RAM, dual x dual core
CPU's (=Dell 1950). Cost is about $2-5K.
+  A. POS beige box machine with 2 x SATA drives of variable size, 4 GB RAM, single dual core
CPU. Cost is about $1K.
+ For a $50K budget, most users would take 25x(B) over 50x(C) due to simpler and smaller admin
issues even though cost/performance would be nominally about the same. Most users would avoid
2x(A) like the plague.
+ For the discussion to 3, "commodity" hardware is best defined as consisting of standardized,
easily available components which can be purchased from multiple distributors/retailers. Given
this definition there are still ranges of quality that can be purchased for your cluster.
As mentioned above, users generally avoid the low-end, cheap solutions. The primary motivating
force to avoid low-end solutions is "real" cost; cheap parts mean greater number of failures
requiring more maintanance/cost. Many users spend $2K-$5K per machine. For a longer discussion
of "scaling out" reference: http://jcole.us/blog/archives/2007/06/10/scaling-out-and-up-a-compromise/
+ '''More specifics:'''
+ Hadoop benefits greatly from ECC memory, which is not low-end. Multi-core boxes tend to
give more computation per dollar, per watt and per unit of operational maintenance. But the
highest clockrate processors tend to not be cost-effective, as do the very largest drives.
So moderately high-end commodity hardware is the most cost-effective for Hadoop today.
+ Some users use cast-off machines that were not reliable enough for other applications. These
machines originally cost about 2/3 what normal production boxes cost and achieve almost exactly
1/2 as much. Production boxes are typically dual CPU's with dual cores.
+ '''RAM:'''
+ Many users find that most hadoop applications are very small in memory consumption. Users
tend to have 4-8 GB machines with 2GB probably being too little.

View raw message