hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/Troubleshooting" by PatrickHunt
Date Tue, 27 Oct 2009 20:31:07 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/Troubleshooting" page has been changed by PatrickHunt.


  It is important to monitor the ZK environment (hardware, network, processes, etc...) in
order to more easily troubleshoot problems. Otherwise you miss out on important information
for determining the cause of the problem. What type of monitoring are you doing on your cluster?
You can monitor at the host level -- that will give you some insight on where to look; cpu,
memory, disk, network, etc... You can also monitor at the process level -- the ZooKeeper server
JMX interface will give you information about latencies and such (you can also use the [[http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands|four
letter words]] for that if you want to hack up some scripts instead of using JMX). JMX will
also give you insight into the JVM workings - so for example you could confirm/ruleout GC
pauses causing the JVM Java threads to hang for long periods of time (see below).
  Without monitoring troubleshooting will be more difficult, but not impossible. JMX can be
used through jconsole, or access the stats through the four letter words, also the log4j log
contains much important/useful information.
+ == Troubleshooting Checklist ==
+ The following can be useful checklist when you are having issues with your ZK cluster, in
particular if you are seeing large numbers of timeouts, sessions expirations, poor performance,
or high operation latencies. Use the following on all servers and potentially on clients as
+  * hdparm with the -t and -T options to test your disk IO
+  * ethtool to check the configuration of your network
+  * ifconfig also to check network and examine error counts
+   * ZK uses TCP for network connectivity, errors on the NICs can cause poor performance
+  * scp/ftp/etc... can be used to verify connectivity, try copying large files between nodes
+  * [[http://github.com/phunt/zk-smoketest#readme|these]] smoke and latency tests can be
useful to verify a cluster
+ See the [[ZooKeeper/ServiceLatencyOverview|Latency Overview]] page for some latency baselines.
  == A word or two about heartbeats ==

View raw message