Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E52B0200D63 for ; Thu, 7 Dec 2017 00:18:45 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id E3B6F160C1D; Wed, 6 Dec 2017 23:18:45 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0D591160C0A for ; Thu, 7 Dec 2017 00:18:44 +0100 (CET) Received: (qmail 96956 invoked by uid 500); 6 Dec 2017 23:18:43 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 96941 invoked by uid 99); 6 Dec 2017 23:18:43 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Dec 2017 23:18:43 +0000 Received: from mail-wr0-f182.google.com (mail-wr0-f182.google.com [209.85.128.182]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id CF7421A003E for ; Wed, 6 Dec 2017 23:18:42 +0000 (UTC) Received: by mail-wr0-f182.google.com with SMTP id s66so5595800wrc.9 for ; Wed, 06 Dec 2017 15:18:42 -0800 (PST) X-Gm-Message-State: AKGB3mLtVcie3azJxcRpNJG1EEBKNxIJVy9sl+0o5WXSJ2jsPwY5mxoK 8A+gZYwnu4VWNF02tGy0El3sisDMMsAKq3I6+ZA= X-Google-Smtp-Source: AGs4zMbvg/yO4Hv6EyFU0xG+Bn7fyJphY7VQf/RnJETwaYAMNMMpI2rBd7oovv/wWo1qmLMMxl6ddSZ4hBhn9b1262A= X-Received: by 10.223.157.41 with SMTP id k41mr4249007wre.281.1512602320785; Wed, 06 Dec 2017 15:18:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.28.176.68 with HTTP; Wed, 6 Dec 2017 15:17:59 -0800 (PST) In-Reply-To: <8DC5B1E1-994D-40EA-AB1C-145C07963271@jordanzimmerman.com> References: <1512418179.3873981.1193776840.22CC5E91@webmail.messagingengine.com> <8DC5B1E1-994D-40EA-AB1C-145C07963271@jordanzimmerman.com> From: Patrick Hunt Date: Wed, 6 Dec 2017 15:17:59 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Zookeeper session expiration To: UserZooKeeper Content-Type: multipart/alternative; boundary="f403043862780455d1055fb42c90" archived-at: Wed, 06 Dec 2017 23:18:46 -0000 --f403043862780455d1055fb42c90 Content-Type: text/plain; charset="UTF-8" What Jordan said + time use is only in the relative sense, not the absolute. Session tracking (expiration) is relative to the start of leadership. Patrick On Mon, Dec 4, 2017 at 12:21 PM, Jordan Zimmerman < jordan@jordanzimmerman.com> wrote: > ZooKeeper, indeed, does not use wall clock time. It uses System.nanoTime() > for most operations. Further, all operations go through the Leader node so > only the Leader's notion of time matters. The Leader manages the session > via a "SessionTracker" instance. The code is in SessionTrackerImpl.java. > There is a sessionExpiryQueue which is a kind of priority queue that > returns expired sessions based on System.nanoTime(). > > -JZ > > > On Dec 4, 2017, at 12:09 PM, Abraham Fine wrote: > > > > Hello Anthony and Shawn- > > > > To the best of my knowledge ZooKeeper does not use the "wall clock" time > > anywhere. So that should not be the problem. > > > > Please consider enabling debug logging, which should allow you to track > > the "pings". > > > > Thanks, > > Abe > > > > On Mon, Dec 4, 2017, at 11:51, Anthony Shaya wrote: > >> Thanks Shawn, should I message the developer mailing list for a more > >> definitive answer? > >> > >> Thanks again for the reply. > >> > >> -----Original Message----- > >> From: Shawn Heisey [mailto:apache@elyograg.org] > >> Sent: Monday, December 4, 2017 2:49 PM > >> To: user@zookeeper.apache.org > >> Subject: Re: Zookeeper session expiration > >> > >> On 12/4/2017 8:22 AM, Anthony Shaya wrote: > >>> My question is related to how session expiration works, I noticed on > many of the client machines the times across these machines were all off > (by anywhere from 1 minute to 20 minutes - which was resolved after > discovery - haven't verified this completely yet). Can this directly affect > session expiration within the zookeeper cluster? > >>> > >>> * I read the following in https://na01.safelinks. > protection.outlook.com/?url=https%3A%2F%2Fwiki.apache.org% > 2Fhadoop%2FZooKeeper%2FFAQ&data=02%7C01%7C%7C6d6643860a4e4a8194c808d53b50 > 23ec%7Cc61157e903cb47589165ee7845cb0ca3%7C0%7C0% > 7C636480137750841475&sdata=RwGGH19FLeYFmXMrg5GBkSLJ65ANj1 > EXkTvwyk6OLd4%3D&reserved=0 , "Expirations happens when the cluster does > not hear from the client within the specified session timeout period (i.e. > no heartbeat).". So in some case it seems like if the times were wrong > across the machines its possible one of the clients could of effectively > sent a heart beat in the past (not sure about this tbh) and then the > cluster expires the session? > >> > >> I make these comments without any knowledge of what ZK code actually > >> does. I am a member of this list because I'm a representative of the > >> Apache Solr project, which uses the ZK client in order to maintain a > >> cluster. > >> > >> IMHO, any software which makes actual decisions based on the timestamps > >> in messages from another system is badly designed. I would hope that > the > >> ZK designers know this, and always make any decisions related to time > >> using the clock in the local system only. > >> > >> If ZK's designers did the right thing, then a session timeout would > >> indicate that quite literally no heartbeats were received in X seconds, > >> as measured by the local clock, and the local clock ONLY ... NOT from > >> timestamp information received from another system. > >> > >> Although such a lack of communication could be caused by any number of > >> things, including network hardware failure, one of the most common > >> reasons I have seen for problems like this is extreme java garbage > >> collection pauses in the client software. > >> > >> Situations where the heap is a little bit too small can cause a java > >> program to basically be doing garbage collection constantly, so it > >> doesn't have much time to do anything else, like send heartbeats to ZK > >> servers. > >> > >> Situations where the heap is HUGE and garbage collection is not well > >> tuned can lead to pauses of a minute or longer while Java does a massive > >> full GC. > >> > >>> * I don't have the zookeeper node log for the above time to see > what was going on in zookeeper when the cluster determined the session > expired. > >>> > >>> * Is there any additional logging I can turn on to troubleshoot zk > session expiration issues? > >> > >> Hopefully your ZK clients also have logging. Failing that, you could > >> turn on GC logging for the software with the ZK client (assuming it's a > >> Java client) and find a program or website that can examine the log and > >> give you statistics or a graph of GC pauses. > >> > >> If there is a problem in software using the client and whatever logging > >> is available doesn't help you figure out what's wrong, you're generally > >> going to need to talk to whoever wrote that software for help > >> troubleshooting it. > >> > >> Thanks, > >> Shawn > >> > >> > >> > >> This message is intended exclusively for the individual or entity to > >> which it is addressed. This communication may contain information that > is > >> proprietary, privileged, confidential or otherwise legally exempt from > >> disclosure. If you are not the named addressee, or have been > >> inadvertently and erroneously referenced in the address line, you are > not > >> authorized to read, print, retain, copy or disseminate this message or > >> any part of it. If you have received this message in error, please > notify > >> the sender immediately by e-mail and delete all copies of the message. > >> (ID m031214) > > --f403043862780455d1055fb42c90--