hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bogdan Ghidireac <ghidir...@gmail.com>
Subject Endless ZK loop
Date Wed, 08 Jun 2011 07:02:52 GMT
Hi all,

We had an interesting problem with HBase and Zookeeper and I would like to
know what your thoughts on this issue are.

I have an HBase client that reads data from a queue and stores it in HBase.
While the client was running one of my colleagues stopped the ZK fleet (3
hosts), removed the ZK data from zoo.dataDir and restarted it (He wanted a
fresh ZK fleet for a test). After that he restarted the HBase fleet.

The HBase client noticed that the ZK fleet was restarted but after the ZK
went online it was not able to reconnect or to close/expire the session. The
client was stuck in an endless loop trying to reconnect. I left the client
run for minutes an nothing happened.

Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening
socket connection to server pa-zk-na-01.aka.domain.com/10.119.206.58:2181
Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket
connection established to pa-zk-na-01.aka.domain.com/10.119.206.58:2181,
initiating session
Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to
read additional data from server sessionid 0x30512689459590, likely server
has closed socket, closing socket connection and attempting reconnect
Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-01.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening
socket connection to server pa-zk-na-02.aka.domain.com/10.194.180.66:2181
Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket
connection established to pa-zk-na-02.aka.domain.com/10.194.180.66:2181,
initiating session
Tue Jun 07 12:52:03 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to
read additional data from server sessionid 0x30512689459590, likely server
has closed socket, closing socket connection and attempting reconnect
Tue Jun 07 12:52:04 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-02.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Opening
socket connection to server pa-zk-na-03.aka.domain.com/10.254.106.137:2181
Tue Jun 07 12:52:04 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Socket
connection established to pa-zk-na-03.aka.domain.com/10.254.106.137:2181,
initiating session
Tue Jun 07 12:52:04 2011 GMT Client
29697-0@pa-oi-na-1001.vdc.domain.com:0[INFO] (main-SendThread(
pa-zk-na-03.aka.domain.com:2181)) org.apache.zookeeper.ClientCnxn: Unable to
read additional data from server sessionid 0x30512689459590, likely server
has closed socket, closing socket connection and attempting reconnect


I checked the HBase code (ZooKeeperWatcher.java) and the
connectionEvent(WatchedEvent event) method seems to ignore the Disconnected
event. I do not expect my session to be terminated once a Disconnected event
is received but I expect the session to be terminated if I cannot reconnect
after a period of time (for example ZK session timeout or the negotiated
timeout).

http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3

The ZK wiki says that the client has to reconnect to receive the Expired
event but this is not always possible. The ZK client library has to initiate
the SessionExpired event (or a similar event like ClientSessionExpired) when
the client is disconnected for more than X seconds.

I assume there are other cases when the client and the quorum are both up
and running but they cannot communicate (a network split for example). I
think both the ZK client library and the quorum should act independently and
expire the session on their side.

Regards,
Bogdan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message