hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/Troubleshooting" by PatrickHunt
Date Thu, 16 Apr 2009 20:22:10 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by PatrickHunt:

New page:
= Troubleshooting ZooKeeper Operating Environment =

This page details specific problems people have seen, solutions (if solved) to those issues
and the types of steps taken to troubleshoot the issue. Feel free to update with your experiences.

== Monitoring ==

It is important to monitor the ZK environment (hardware, network, processes, etc...) in order
to more easily troubleshoot problems. Otherwise you miss out on important information for
determining the cause of the problem. What type of monitoring are you doing on your cluster?
You can monitor at the host level -- that will give you some insight on where to look; cpu,
memory, disk, network, etc... You can also monitor at the process level -- the ZooKeeper server
JMX interface will give you information about latencies and such (you can also use the [http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
four letter words] for that if you want to hack up some scripts instead of using JMX). JMX
will also give you insight into the JVM workings - so for example you could confirm/ruleout
GC pauses causing the JVM Java threads to hang for long periods of time (see below).

Without monitoring troubleshooting will be more difficult, but not impossible. JMX can be
used through jconsole, or access the stats through the four letter words, also the log4j log
contains much important/useful information.

== A word or two about heartbeats ==

Keep in mind the the session timeout period is used by both the client and the server. If
the ZK leader doesn't hear from the client w/in the timeout (say it's 5 sec) it will expire
the session. The client is sending a ping after 1/3 of the timeout period. It expects to hear
a response before another 1/3 of the timeout elapses, after which it will attempt to re-sync
to another server in the cluster. In the 5 sec timeout case you are allowing 1.3 seconds for
the request to go to the server, the server to respond back to the client, and the client
to process the response. Check the latencies in ZK's JMX in order to get insight into this.
i.e. if the server latency is high, say because of io issues, or jvm swapping, vm latency,
etc... that will cause the client/sessions to timeout.

== Frequent client disconnects & session expirations ==

ZooKeeper is a canary in a coal mine of sorts. Because of the heart-beating performed by the
clients and servers ZooKeeper based applications are very sensitive to things like network
and system latencies. We often see client disconnects and session expirations associated with
these types of problems.

Take a look at [http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems
this section] to start.

=== Hardware misconfiguration - NIC ===

In one case there was a cluster of 5k ZK clients attaching to a ZK cluster, ~20% of the clients
had mis-configured NICs, this was causing high tcp packet loss (and therefore high network
latency), which caused disconnects (timeout exceeded), but only under fairly high network
load (which made it hard to track down!). In the end special processes were setup to continuously
monitor client server network latency. Any spikes in the latencies observed were then correlated
to the ZK logs (timeouts). In the end all of the NICs were reconfigured on these hosts.

=== Hardware - network switch ===

Another issue with the same user as the NIC issue - a cluster of 5k ZK clients attaching to
a ZK cluster. It turned out that the network switches had bad firmware which caused high packet
latencies under heavy load. At certain times of day we would see high numbers of ZK client
disconnects. It turned out that these were periods of heavy network activity, exacerbated
by the ZK client session expirations (they caused even more network traffic). In the end the
operations team spent a number of days testing/loading the network infrastructure until they
were able to pin down the issue as being switch related. The switch firmware was upgraded
and this issue was eventually resolved.

=== Virtual environments ===

We've seen situations where users run the entire zk cluster on a set of VMWare vms, all on
the same host system. Latency on this configuration was >>> 10sec in some cases due
to resource issues (in particular io - see the link I provided above, dedicated log devices
are critical to low latency operation of the ZK cluster). Obviously no one should be running
in this configuration in production - in particular there will be no reliability in cases
where the host storage fails!

=== Virtual environments - "Cloud Computing" ===

In one scenario involving EC2 ZK was seeing frequent client disconnects. The user had configured
a timeout of 5 seconds, which is too low, probably much too low. Why? You are running in virtualized
environments on non-dedicated hardware outside your control/inspection. There is typically
no way to tell (unless you are running on the 8 core ec2 systems) if the ec2 host you are
running on is over/under subscribed (other vms). There is no way to control disk latency either.
You could be seeing large latencies due to resource contention on the ec2 host alone. In addition
to that I've heard that network latencies in ec2 are high relative to what you would see if
you were running on your own dedicated environment. It's hard to tell the latency btw the
servers and client->server w/in the ec2 environment you are seeing w/out measuring it.

=== GC pressure ===

The Java GC can cause [https://issues.apache.org/jira/browse/HBASE-1316 starvation of the
Java threads] in the VM. This manifests itself as client disconnects and session expirations
due to starvation of the heartbeat thread. The GC runs, locking out all Java threads from

This issue can be resolved in a few ways:

First look at using one of the alternative GCs, in particular [http://developer.amd.com/documentation/articles/pages/4EasyWaystodoJavaGarbageCollectionTuning.aspx
low latency GC]:

e.g. the following JVM option: -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC

Secondly you might try the solution used by HBASE, spawn a non-Java (JNI) thread to manage
your ephemeral znodes. This is a pretty advanced option however, try the alternative GC first
and see if that helps.

View raw message