zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Serious problem processing hearbeat on login stampede
Date Wed, 13 Apr 2011 15:21:34 GMT
Hi Chang, it sounds like you may have an issue with your cluster
environment/setup, or perhaps a resource (GC/mem) issue. Have you
looked through the troubleshooting guide?
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting

In particular 1000 clients connecting should be fine, I've personally
seen clusters of 7-10 thousand clients. Keep in mind that each session
establishment is essentially a write (so the quorum in involved) and
what we typically see there is that the cluster configuration has
issues. 14 seconds for a ping response is huge and indicates one of
the following may be an underlying cause:

1) are you running in a virtualized environment?
2) are you co-locating other services on the same host(s) that make up
the ZK serving cluster?
3) have you followed the admin guide's "things to avoid"?
http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_commonProblems
In particular ensuring that you are not swapping or going into gc
pause (both on the server and the client)
a) try turning on GC logging and ensure that you are not going into GC
pause, see the troubleshooting guide, this is the most common cause of
high latency for the clients
b) ensure that you are not swapping
c) ensure that other processes are not causing log writing
(transactional logging) to be slow.

Patrick

On Wed, Apr 13, 2011 at 6:35 AM, Chang Song <tru64ufs@me.com> wrote:
> Hello, folks.
>
> We have ran into a very serious issue with Zookeeper.
> Here's a brief scenario.
>
> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec ping), let's
called
> these clients, group A.
>
> Now 1000 new clients (let's call these, group B) starts up at the same time trying to
> connect to a three-node ZK ensemble, creating ZK createSession stampede.
>
> Now almost all clients in group A is not able to exchange ping within session expire
time (15 sec).
> Thus clients in group A drops out of the cluster.
>
> We have looked into this issue a bit, found mostly synchronous nature of session queue
processing.
> Latency between ping request and response ranges from 10ms up to 14 seconds during this
login stampede.
>
> Since session timeout is serious matter for our cluster, thus ping should be done in
psuedo realtime fashion.
>
> I don't know exactly how these ping timeout policy in clients and server, but failure
to receive ping
> response in clients due to zookeeper login session seem very nonsense to me.
>
> Shouldn't we have a separate ping/heartbeat queue and thread?
> Or even multiple ping queues/threads to keep realtime heartbeat?
>
> THis is very serious issue with Zookeeper for our mission-critical system. Could anyone
> look into this?
>
> I will try to file a bug.
>
> Thank you.
>
> Chang
>
>
>

Mime
View raw message