Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.
The following page has been changed by SteveLoughran:
http://wiki.apache.org/hadoop/LargeClusterTips
The comment on the change is:
formatting
------------------------------------------------------------------------------
Things will go wrong. There is always SPOF. Test your failure handling processes before
you go live.
- * Simulate a corrupted edit log by killing the namenode process, truncating the (binary)
edit log, and bringing it up. See how the team handles it.
+ * Simulate a corrupted edit log by killing the namenode process, truncating the (binary)
edit log, and bringing it up. See how the team handles it.
- * Turn off one of the switches, pull out a network cable. See how the cluster handles it,
how it recovers. Then put the switch back on.
+ * Turn off one of the switches, pull out a network cable. See how the cluster handles it,
how it recovers. Then put the switch back on.
- * Turn an entire rack off without warning. See what happens when they go offline.
+ * Turn an entire rack off without warning. See what happens when they go offline.
- * Turn off DNS.
+ * Turn off DNS. Or just the rDNS side of things.
- * Turn off the entire datacenter, switch it back on. Are there any race conditions?
+ * Turn off the entire datacenter, switch it back on. Are there any race conditions?
- * Write an job that tries to generate too much data, fills up the cluster. How is it handled?
+ * Write an job that tries to generate too much data, fills up the cluster. How is it handled?
|