hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DiskSetup" by EdwardCapriolo
Date Mon, 13 Jul 2009 21:43:50 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by EdwardCapriolo:
http://wiki.apache.org/hadoop/DiskSetup

------------------------------------------------------------------------------
  
  == Hardware ==
  
- You don't need RAID disk controllers for Hadoop, as it copies data across multiple machines
instead. This increase the likelihood that there is a free task slot near that data, and if
the servers are on different PSUs and switches, eliminates some more points of failure in
the datacenter.
+ You don't need RAID disk controllers for Hadoop Data Node, as it copies data across multiple
machines instead. This increase the likelihood that there is a free task slot near that data,
and if the servers are on different PSUs and switches, eliminates some more points of failure
in the data center.
+ 
+ While the Hadoop Name Node and Secondary Name Node can write to a list of drive locations,
they will stop functioning if it can not write to ALL the locations. In this case a mirrored
RAID is a good idea for higher availability.
  
  Having lots of disks per server gives you more raw IO bandwidth than having one or two big
disks. If you have enough that different tasks can be using different disks for input and
output, disk seeking is minimized, which is one of the big disk performance killers. That
said: more disks have a higher power budget; if you are power limited, you may want fewer
but larger disks.
  
@@ -22, +24 @@

  
  If mount the disks as noatime, then the file access times aren't written back; this speeds
up reads. There is also relatime, which stores some access time information, but is not as
slow as the classic atime attribute. Remember that any access time information kept by Hadoop
is independent of the atime attribute of individual blocks, so Hadoop does not care what your
settings are here. If you are mounting disks purely for hadoop, use noatime.
  
+ Formatting and tuning options are important. Using tunefs to set the reserve to zero percent
can save you over 25 GigaBytes on a 1 TeraByte disk. Also the underlying file system is going
to have many large files, you can get more space by lowering the number of inodes at format
time.
  === Ext3 ===
  
  Yahoo! has publicly stated they use ext3. Regardless of the merits of the filesystem, that
means that HDFS-on-ext3 has been publicly tested at a bigger scale than any other underlying
filesystem.

Mime
View raw message