hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DiskSetup" by SteveLoughran
Date Wed, 20 May 2009 13:02:56 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:

The comment on the change is:
new page on disks for Hadoop.

New page:
= Setting up Disks for Hadoop =

Here are some recommendations for setting up disks in a Hadoop cluster. What we have here
is anecdotal -hard evidence is very welcome, and everyone should expect a bit of trial and
error work.

== Key Points ==

Goals for a Hadoop cluster are normally massive amounts of data with high I/O bandwidth. Your
MapReduce jobs may be IO bound or CPU/Memory bound -if you know which one is more important
(effectively how many CPU cycles/RAM MB used per Map or Reduce), you can make better decisions.

== Hardware ==

You don't need RAID disk controllers for Hadoop, as it copies data across multiple machines
instead. This increase the likelihood that there is a free task slot near that data, and if
the servers are on different PSUs and switches, eliminates some more points of failure in
the datacenter.

Having lots of disks per server gives you more raw IO bandwidth than having one or two big
disks. If you have enough that different tasks can be using different disks for input and
output, disk seeking is minimized, which is one of the big disk performance killers. That
said: more disks have a higher power budget; if you are power limited, you may want fewer
but larger disks.

== Configuring Hadoop ==

Pass a list of disks to the dfs.data.dir parameter, Hadoop will use all of the disk that are

== Underlying File System ==

=== Ext3 ===

It's widely believed that Yahoo! use ext3. Regardless of the merits of the filesystem, that
means that HDFS-on-ext3 has been publicly tested at a bigger scale than any other underlying

=== XFS ===

>>From Bryan on the core-user list on 19 May 2009:

 We use XFS for our data drives, and we've had somewhat mixed results. One of the biggest
pros is that XFS has more free space than ext3, even with the reserved space settings turned
all the way to 0. Another is that you can format a 1TB drive as XFS in about 0 seconds, versus
minutes for ext3. This makes it really fast to kickstart our worker nodes.

 We have seen some weird stuff happen though when machines run out of memory, apparently because
the XFS driver does something odd with kernel memory. When this happens, we end up having
to do some fscking before we can get that node back online.

 As far as outright performance, I actually *did* do some tests of xfs vs ext3 performance
on our cluster. If you just look at a single machine's local disk speed, you can write and
read noticeably faster when using XFS instead of ext3. However, the reality is that this extra
disk performance won't have much of an effect on your overall job completion performance,
since you will find yourself network bottlenecked well in advance of even ext3's performance.

 The long and short of it is that we use XFS to speed up our new machine deployment, and that's

View raw message