zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <apa...@elyograg.org>
Subject Re: Zookeeper session expiration
Date Mon, 04 Dec 2017 19:49:26 GMT
On 12/4/2017 8:22 AM, Anthony Shaya wrote:
> My question is related to how session expiration works, I noticed on many of the client
machines the times across these machines were all off (by anywhere from 1 minute to 20 minutes
- which was resolved after discovery - haven't verified this completely yet). Can this directly
affect session expiration within the zookeeper cluster?
> 
>    *   I read the following in https://wiki.apache.org/hadoop/ZooKeeper/FAQ , "Expirations
happens when the cluster does not hear from the client within the specified session timeout
period (i.e. no heartbeat).". So in some case it seems like if the times were wrong across
the machines its possible one of the clients could of effectively sent a heart beat in the
past (not sure about this tbh) and then the cluster expires the session?

I make these comments without any knowledge of what ZK code actually 
does.  I am a member of this list because I'm a representative of the 
Apache Solr project, which uses the ZK client in order to maintain a 
cluster.

IMHO, any software which makes actual decisions based on the timestamps 
in messages from another system is badly designed.  I would hope that 
the ZK designers know this, and always make any decisions related to 
time using the clock in the local system only.

If ZK's designers did the right thing, then a session timeout would 
indicate that quite literally no heartbeats were received in X seconds, 
as measured by the local clock, and the local clock ONLY ... NOT from 
timestamp information received from another system.

Although such a lack of communication could be caused by any number of 
things, including network hardware failure, one of the most common 
reasons I have seen for problems like this is extreme java garbage 
collection pauses in the client software.

Situations where the heap is a little bit too small can cause a java 
program to basically be doing garbage collection constantly, so it 
doesn't have much time to do anything else, like send heartbeats to ZK 
servers.

Situations where the heap is HUGE and garbage collection is not well 
tuned can lead to pauses of a minute or longer while Java does a massive 
full GC.

>    *   I don't have the zookeeper node log for the above time to see what was going on
in zookeeper when the cluster determined the session expired.
> 
>    *   Is there any additional logging I can turn on to troubleshoot zk session expiration
issues?

Hopefully your ZK clients also have logging.  Failing that, you could 
turn on GC logging for the software with the ZK client (assuming it's a 
Java client) and find a program or website that can examine the log and 
give you statistics or a graph of GC pauses.

If there is a problem in software using the client and whatever logging 
is available doesn't help you figure out what's wrong, you're generally 
going to need to talk to whoever wrote that software for help 
troubleshooting it.

Thanks,
Shawn

Mime
View raw message