zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Ping and client session timeouts
Date Fri, 21 May 2010 20:54:08 GMT

On 05/21/2010 11:32 AM, Stephen Green wrote:
> Right.  The system can be very memory-intensive, but at the time these
> are occurring, it's not under a really heavy load, and there's plenty
> of heap available. However, while looking at a thread dump from one of
> the nodes, I realized that a very poor decision meant that I had more
> than 1200 threads running.  I expect this is more of a problem than
> the GC at this point.  I'm taking steps to correct this problem now.
> Lately, I've had fewer and fewer problems with GC.  In a former life,
> I sat down the hall from the folks who wrote Hotspot's GC and they're
> pretty sharp folks :-)

GC as a cause is very common, however had you mentioned 1200 threads I 
would have guessed that to be a potential issue. ;-)

> Right.  I'd like to have as small a timeout as possible so that I
> notice quickly when things disappear.  What's a reasonable minimum?  I
> notice recommendations in other messages on the list that 20000 is a
> good value.

The setting you should use typically is determined by your sla 
requirements. How soon do you want ephemeral nodes to be cleaned up if a 
client fails? Say you were doing leader election, this would gate 
re-election in the case where the current leader failed (set it lower 
and you are more responsive (faster), but also more susceptible to 
"false positives" (such as temp network glitch). Set it higher and you 
ride over the network glitches however it takes longer to recover when a 
client really does go down).

In some cases (hbase, solr) we've seen that the timeout had to be set 
artificially high due to the limitations of the current JVM GC algos. 
For example some hbase users were seeing GC pause times of > 4 minutes. 
So this raises the question - do you consider this a failure or not? (I 
could reboot the machine faster than it takes to run that GC...)

Good luck,


View raw message