zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chang Song <tru64...@me.com>
Subject Serious problem processing hearbeat on login stampede
Date Wed, 13 Apr 2011 13:35:29 GMT
Hello, folks.

We have ran into a very serious issue with Zookeeper.
Here's a brief scenario.

We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's called
these clients, group A.

Now 1000 new clients (let's call these, group B) starts up at the same time trying to 
connect to a three-node ZK ensemble, creating ZK createSession stampede.

Now almost all clients in group A is not able to exchange ping within session expire time
(15 sec).
Thus clients in group A drops out of the cluster.

We have looked into this issue a bit, found mostly synchronous nature of session queue processing.
Latency between ping request and response ranges from 10ms up to 14 seconds during this login

Since session timeout is serious matter for our cluster, thus ping should be done in psuedo
realtime fashion.

I don't know exactly how these ping timeout policy in clients and server, but failure to receive
response in clients due to zookeeper login session seem very nonsense to me.

Shouldn't we have a separate ping/heartbeat queue and thread?
Or even multiple ping queues/threads to keep realtime heartbeat?

THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
look into this?

I will try to file a bug.

Thank you.


View raw message