zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Shaya <ash...@workforcesoftware.com>
Subject RE: Zookeeper session expiration
Date Mon, 04 Dec 2017 19:51:32 GMT
Thanks Shawn, should I message the developer mailing list for a more definitive answer?

Thanks again for the reply.

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org] 
Sent: Monday, December 4, 2017 2:49 PM
To: user@zookeeper.apache.org
Subject: Re: Zookeeper session expiration

On 12/4/2017 8:22 AM, Anthony Shaya wrote:
> My question is related to how session expiration works, I noticed on many of the client
machines the times across these machines were all off (by anywhere from 1 minute to 20 minutes
- which was resolved after discovery - haven't verified this completely yet). Can this directly
affect session expiration within the zookeeper cluster?
>    *   I read the following in https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwiki.apache.org%2Fhadoop%2FZooKeeper%2FFAQ&data=02%7C01%7C%7C6d6643860a4e4a8194c808d53b5023ec%7Cc61157e903cb47589165ee7845cb0ca3%7C0%7C0%7C636480137750841475&sdata=RwGGH19FLeYFmXMrg5GBkSLJ65ANj1EXkTvwyk6OLd4%3D&reserved=0
, "Expirations happens when the cluster does not hear from the client within the specified
session timeout period (i.e. no heartbeat).". So in some case it seems like if the times were
wrong across the machines its possible one of the clients could of effectively sent a heart
beat in the past (not sure about this tbh) and then the cluster expires the session?

I make these comments without any knowledge of what ZK code actually does.  I am a member
of this list because I'm a representative of the Apache Solr project, which uses the ZK client
in order to maintain a cluster.

IMHO, any software which makes actual decisions based on the timestamps in messages from another
system is badly designed.  I would hope that the ZK designers know this, and always make any
decisions related to time using the clock in the local system only.

If ZK's designers did the right thing, then a session timeout would indicate that quite literally
no heartbeats were received in X seconds, as measured by the local clock, and the local clock
ONLY ... NOT from timestamp information received from another system.

Although such a lack of communication could be caused by any number of things, including network
hardware failure, one of the most common reasons I have seen for problems like this is extreme
java garbage collection pauses in the client software.

Situations where the heap is a little bit too small can cause a java program to basically
be doing garbage collection constantly, so it doesn't have much time to do anything else,
like send heartbeats to ZK servers.

Situations where the heap is HUGE and garbage collection is not well tuned can lead to pauses
of a minute or longer while Java does a massive full GC.

>    *   I don't have the zookeeper node log for the above time to see what was going on
in zookeeper when the cluster determined the session expired.
>    *   Is there any additional logging I can turn on to troubleshoot zk session expiration

Hopefully your ZK clients also have logging.  Failing that, you could turn on GC logging for
the software with the ZK client (assuming it's a Java client) and find a program or website
that can examine the log and give you statistics or a graph of GC pauses.

If there is a problem in software using the client and whatever logging is available doesn't
help you figure out what's wrong, you're generally going to need to talk to whoever wrote
that software for help troubleshooting it.


This message is intended exclusively for the individual or entity to which it is addressed.
This communication may contain information that is proprietary, privileged, confidential or
otherwise legally exempt from disclosure. If you are not the named addressee, or have been
inadvertently and erroneously referenced in the address line, you are not authorized to read,
print, retain, copy or disseminate this message or any part of it. If you have received this
message in error, please notify the sender immediately by e-mail and delete all copies of
the message. (ID m031214)

View raw message