hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/Troubleshooting" by PatrickHunt
Date Mon, 30 Nov 2009 21:28:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/Troubleshooting" page has been changed by PatrickHunt.


  The following can be useful checklist when you are having issues with your ZK cluster, in
particular if you are seeing large numbers of timeouts, sessions expirations, poor performance,
or high operation latencies. Use the following on all servers and potentially on clients as
   * hdparm with the -t and -T options to test your disk IO
+  * time dd if=/dev/urandom bs=512000 of=/tmp/memtest count=1050 
+   * time md5sum /tmp/memtest; time md5sum /tmp/memtest; time md5sum /tmp/memtest 
+   * See ECC memory section below for more on this
   * ethtool to check the configuration of your network
   * ifconfig also to check network and examine error counts
    * ZK uses TCP for network connectivity, errors on the NICs can cause poor performance
@@ -79, +82 @@

  Poor disk IO will also result in increased operation latencies. Use hdparm with the -t and
-T options to verify the performance of persistent storage.
+ === Hardware - ECC memory problems can be hard to track down ===
+ I've seen a particularly nasty problem where bad ECC memory was causing a single server
to run an order of magnitude slower than the rest of the servers in the cluster. This caused
some particularly nasty/random problems that were nearly impossible to track down (since the
machine kept running, just slowly). Ops replaced the ECC memory and all was fine. See the
troubleshooting checklist at the top of this page -- the dd/md5sum commands listed there can
help to sniff this out (hint: compare the results on all of your servers and verify they are
at least "close").
  === Virtual environments ===
  We've seen situations where users run the entire zk cluster on a set of VMWare vms, all
on the same host system. Latency on this configuration was >>> 10sec in some cases
due to resource issues (in particular io - see the link I provided above, dedicated log devices
are critical to low latency operation of the ZK cluster). Obviously no one should be running
in this configuration in production - in particular there will be no reliability in cases
where the host storage fails!

View raw message