zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject ZK Client won't time out when quorum irrevocably goes away
Date Thu, 03 Feb 2011 22:57:14 GMT
Hey all,

We had an issue here at SU, where we moved a ZK cluster from under a
series of running clients (HBase master & regionservers)... but the
clients never handled the situation well, and we had to do a bunch of
process kills.

One log had lines like this:

2011-02-03 04:25:21,992 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server zookeeper/10.10.21.10:2181
2011-02-03 04:25:33,992 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 13423ms for sessionid
0x42d4ff0a1034fe3, closing socket connection and attempting reconnect
2011-02-03 04:25:34,170 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server zookeeper/10.10.21.12:2181
2011-02-03 04:25:46,168 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 12075ms for sessionid
0x42d4ff0a1034fe3, closing socket connection and attempting reconnect
2011-02-03 04:25:47,058 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server zookeeper/10.10.21.11:2181
2011-02-03 04:25:59,056 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 12787ms for sessionid
0x42d4ff0a1034fe3, closing socket connection and attempting reconnect

The problem was we _moved_ the machines, and renumbered the IPs,
therefore the OLD ip was no longer pingable, and apparently from this
log message the socket connect just timed out. It might be that
timeouts are handled differently than connection refused, I haven't
done that digging yet.

The result was the client never realized that it's session was
actually timed out, and the HBase processes continued to run. Kill -9
and a restart fixed it.

Thoughts?
-ryan

Mime
View raw message