Hey all, We had an issue here at SU, where we moved a ZK cluster from under a series of running clients (HBase master & regionservers)... but the clients never handled the situation well, and we had to do a bunch of process kills. One log had lines like this: 2011-02-03 04:25:21,992 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zookeeper/10.10.21.10:2181 2011-02-03 04:25:33,992 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 13423ms for sessionid 0x42d4ff0a1034fe3, closing socket connection and attempting reconnect 2011-02-03 04:25:34,170 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zookeeper/10.10.21.12:2181 2011-02-03 04:25:46,168 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 12075ms for sessionid 0x42d4ff0a1034fe3, closing socket connection and attempting reconnect 2011-02-03 04:25:47,058 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zookeeper/10.10.21.11:2181 2011-02-03 04:25:59,056 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 12787ms for sessionid 0x42d4ff0a1034fe3, closing socket connection and attempting reconnect The problem was we _moved_ the machines, and renumbered the IPs, therefore the OLD ip was no longer pingable, and apparently from this log message the socket connect just timed out. It might be that timeouts are handled differently than connection refused, I haven't done that digging yet. The result was the client never realized that it's session was actually timed out, and the HBase processes continued to run. Kill -9 and a restart fixed it. Thoughts? -ryan