hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/Troubleshooting" by PatrickHunt
Date Tue, 27 Oct 2009 20:24:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "ZooKeeper/Troubleshooting" page has been changed by PatrickHunt.


  Another issue with the same user as the NIC issue - a cluster of 5k ZK clients attaching
to a ZK cluster. It turned out that the network switches had bad firmware which caused high
packet latencies under heavy load. At certain times of day we would see high numbers of ZK
client disconnects. It turned out that these were periods of heavy network activity, exacerbated
by the ZK client session expirations (they caused even more network traffic). In the end the
operations team spent a number of days testing/loading the network infrastructure until they
were able to pin down the issue as being switch related. The switch firmware was upgraded
and this issue was eventually resolved.
+ === Hardware - ifconfig is your friend ===
+ A recent issue we saw extremely poor performance from a 3 server ZK ensemble (cluster).
Average and max latencies on operations as reported by the "stat" command on the servers was
very high (multiple seconds). Turns out that one of the servers had a NIC that was dropping
large numbers of packets due to framing problems. Switching out that server with another (no
nic issue) resolved the issue. Weird thing was that SSH/SCP/PING etc reported no problems.
+ Moral of the story: use ifconfig to verify the network interface if you are seeing issues
on the cluster.
+ === Hardware - hdparm is your friend ===
+ Poor disk IO will also result in increased operation latencies. Use hdparm with the -t and
-T options to verify the performance of persistent storage.
  === Virtual environments ===
  We've seen situations where users run the entire zk cluster on a set of VMWare vms, all
on the same host system. Latency on this configuration was >>> 10sec in some cases
due to resource issues (in particular io - see the link I provided above, dedicated log devices
are critical to low latency operation of the ZK cluster). Obviously no one should be running
in this configuration in production - in particular there will be no reliability in cases
where the host storage fails!

View raw message