zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fournier, Camille F. [Tech]" <Camille.Fourn...@gs.com>
Subject RE: Serious problem processing hearbeat on login stampede
Date Mon, 18 Apr 2011 14:49:12 GMT
Is it possible this is related to this report back in February?

I theorized that the issue might be due to synchronization on the session table, but never
got enough information to finish the investigation. 


-----Original Message-----
From: Chang Song [mailto:tru64ufs@me.com] 
Sent: Saturday, April 16, 2011 8:31 AM
To: user@zookeeper.apache.org
Cc: zookeeper-user@hadoop.apache.org
Subject: Re: Serious problem processing hearbeat on login stampede


That's exactly the same symptom (queueing in CommitRequestProcessor)
We didn't bypass ping, but we pushed ping request from the beginning of the queue
directly to FinalRequestProcessor(), but it didn't alleviate the problem.

We will post a bit more detailed analysis in the ZK JIRA bug soon

Thank you.


ps. we are also working toward getting a simple reproducer so that committer can 
      reproduce and fix. 

2011. 4. 16.,  8:36, Lakshman ۼ:

> Hi Everyone,
> We also faced similar [session timeout] issue but in a different scenario.
> Here is some analysis I've done sometime back. Same has been posted on
> zookeeper-user forum.
> There is no under provisioning on server side.
> Issue is resolved after bypassing the ping requests from the queue. This may
> not be a good idea. But we just gave a try.
> Earlier mail which I've posted on forum.
> *********************************
> Subject: Frequent SessionTimeoutException[Client] -
> CancelledKeyException[Server]
> We are using zookeeper 3.3.1. And more frequently we are hitting
> CancelledKeyException after startup of application.
> Average response time is less than 50 milliseconds. But the last request
> sent is not getting any response for 20 seconds so its timing out.
> When analyzed, we found some possible problem with CommitRequestProcessor.
> Following are the series of steps happening.
> Client has sent some request[exists, setData, etc.] 
> Server received the packet completely. That is submitted for processing.
> [nextPending] 
> Client has sent some ping requests after that.
> Server has received the ping request as well and that is also queued up.
> Client is timing out as it didn't get any response from server.
> This is because ping requests are also getting queued up into
> queuedRequests.
> Its waiting for a commitedRequest for the current nextPending operation. 
> As per my understanding pings request from client need not be queued up and
> can be processed immediately.
> *********************************
> --
> Thanks
> Laxman
> -----Original Message-----
> From: Chang Song [mailto:tru64ufs@me.com] 
> Sent: Wednesday, April 13, 2011 7:05 PM
> To: zookeeper-user@hadoop.apache.org
> Subject: Serious problem processing hearbeat on login stampede
> Hello, folks.
> We have ran into a very serious issue with Zookeeper.
> Here's a brief scenario.
> We have some Zookeeper clients with session timeout of 15 sec (thus 5 sec
> ping), let's called these clients, group A.
> Now 1000 new clients (let's call these, group B) starts up at the same time
> trying to connect to a three-node ZK ensemble, creating ZK createSession
> stampede.
> Now almost all clients in group A is not able to exchange ping within
> session expire time (15 sec).
> Thus clients in group A drops out of the cluster.
> We have looked into this issue a bit, found mostly synchronous nature of
> session queue processing.
> Latency between ping request and response ranges from 10ms up to 14 seconds
> during this login stampede.
> Since session timeout is serious matter for our cluster, thus ping should be
> done in psuedo realtime fashion.
> I don't know exactly how these ping timeout policy in clients and server,
> but failure to receive ping response in clients due to zookeeper login
> session seem very nonsense to me.
> Shouldn't we have a separate ping/heartbeat queue and thread?
> Or even multiple ping queues/threads to keep realtime heartbeat?
> THis is very serious issue with Zookeeper for our mission-critical system.
> Could anyone look into this?
> I will try to file a bug.
> Thank you.
> Chang

View raw message