hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "LargeClusterTips" by SteveLoughran
Date Wed, 24 Jun 2009 10:13:03 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/LargeClusterTips

The comment on the change is:
more ideas, including things to test before you go live

------------------------------------------------------------------------------
  
   * Have a good sysadmin if you're not one yourself.
   * Take a look at a presentation done by Allen Wittenauer from Yahoo!: http://tinyurl.com/5foamm
-  * Have the LAN closed off to untrusted users. This simplifies security.
+  * Have the LAN closed off to untrusted users. Without this, your filesystem is effectively
open to everyone on the network.
+  * Once you are on the private LAN, turn off all firewalls on the machines, as it only creates
connectivity problems.
   * Use LDAP or similar to manage user accounts.
   * Only put the slaves file on your namenode and secondary namenode to prevent confusion.
+ 
-  * Have identical hardware on all machines in the cluster, eliminating the need to have
different
-    configuration options (task slots, data directory locations, etc)
   * Use RPMs to install the Hadoop binaries. Self:Cloudera provide some RPMs for this, and
a web site to generate configuration RPM files.
   * Use kickstart or similar to bring up the machines. 
-  * Consider a system configuration management package to keep Hadoop's source and configuration
consistent across all nodes.  Some example packages are bcfg2, smartfrog, puppet, cfengine,
etc. 
   * If you are trying to configure the machines one by one, step away from the keyboard.
That is not the way to manage a cluster.
+  * Keep an eye out for disk SMART messages in the server logs. They warn of trouble.
+  * Keep an eye on disk capacity, especially on the namenode. You do not want the NN to run
out of storage as Bad Things happen.
+  * Keep the underlying software in sync: OS, Java version. 
+  * Run the rebalancer, throttled back appropriately for your bandwidth
+ 
  
  See the Self:AmazonEC2 and AmazonS3 pages for tips on managing clusters built on EC2 and
S3.
  
  Other good documentation: [http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
Patterns of Hadoop Deployment]
  
+ == Hadoop Configuration ==
+ 
+  * Don't do it by copying XML files around by hand.
+  * Look at the cloudera config tools. If you use them, keep the previous RPMs around
+  * Consider a system configuration management package to keep Hadoop's source and configuration
consistent across all nodes.  Some example packages are bcfg2, SmartFrog, Puppet, cfengine,
etc. 
+  * Keep your site XML files under SCM, so you can roll-back, diff changes. 
+ 
+ 
+ == NameNode Health ==
+ 
+ The NameNode is a SPOF. When it goes offline, the cluster goes down. If it loses its data,
the filesystem is gone. Value it.
+  * Have a secondary name node! When the BackupNode replaces this, have a BackupNode!
+  * Never let its disks fill up.
+  * Consider RAID storage here. If not, set it to save its data to two independent drives,
ideally on separate controllers (just in case the controller decides to play up)
+  * Set the NN up to save one copy of all its data to a remote machine (NFS?), so even if
the NN goes down, you can bring up a new machine with the same hostname for everything else
to bind to.
+ 
+ 
+ == Workers ==
+ 
+  * Have identical hardware on all workers in the cluster, eliminating the need to have different
configuration options (task slots, data directory locations, etc.)
+  * Have a common user account on every machine you run Hadoop on, with a common public key
in ~/.ssh/authorized_keys
+  * Track HDDs, their history and their failures. Disk failures are not always independent
in a large datacentre.
+  * Have simple hostname to rack or IP to rack mappings, so the rack detection scripts are
trivial.
+ 
+ === How to rebalance a full datanode ===
+ 
+ If a datanode is at or near 100% capacity, 
+  1. Decommission the node: this will copy everything off. 
+  2. Take it offline.
+  3. Delete the data, clean up the HDDs.
+  4. Add the node again. 
+ 
+ == Testing Failure ==
+ 
+ Things will go wrong. There is always SPOF. Test your failure handling processes before
you go live. 
+ 
+ * Simulate a corrupted edit log by killing the namenode process, truncating the (binary)
edit log, and bringing it up. See how the team handles it. 
+ * Turn off one of the switches, pull out a network cable. See how the cluster handles it,
how it recovers. Then put the switch back on.
+ * Turn an entire rack off without warning. See what happens when they go offline.
+ * Turn off DNS. 
+ * Turn off the entire datacenter, switch it back on. Are there any race conditions?
+ * Write an job that tries to generate too much data, fills up the cluster. How is it handled?
+ 

Mime
View raw message