Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id ADB6D200D50 for ; Mon, 4 Dec 2017 20:49:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id AC2A3160BF9; Mon, 4 Dec 2017 19:49:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F2CA1160BF7 for ; Mon, 4 Dec 2017 20:49:32 +0100 (CET) Received: (qmail 65253 invoked by uid 500); 4 Dec 2017 19:49:31 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 65240 invoked by uid 99); 4 Dec 2017 19:49:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Dec 2017 19:49:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 856DD1A15DD for ; Mon, 4 Dec 2017 19:49:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.541 X-Spam-Level: * X-Spam-Status: No, score=1.541 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_BRBL_LASTEXT=1.644, RP_MATCHES_RCVD=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 9O2dh6zdvJOU for ; Mon, 4 Dec 2017 19:49:29 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.219]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id BB6395FB98 for ; Mon, 4 Dec 2017 19:49:28 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id 7CEC1B44 for ; Mon, 4 Dec 2017 12:49:27 -0700 (MST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-language:content-type :content-type:in-reply-to:mime-version:user-agent:date:date :message-id:from:from:references:subject:subject:received :received; s=mail; t=1512416967; bh=GiSMInH+AmkqMDSoF4ioJkXBEpcJ 0xR2do0bZh+w+lY=; b=W5DwM1KGUxeTvV0giA6y1wr+mNJF3tmahewI8JM1ct6t 9JXMytP4MzVqicmvoNRjeYbydICcvXQm8TgAr+hTBwMbnJEhYGKz8uNyKlgCppze No6n/RbCZDbnEdkiOdN7FV5LrYhjRC5IH9GMKczZxeEoxtAHLrS51hd+i2rzIlY= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 6awnualbC6+k for ; Mon, 4 Dec 2017 12:49:27 -0700 (MST) Received: from [192.168.1.111] (111.int.elyograg.org [192.168.1.111]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id 0A9BEB30 for ; Mon, 4 Dec 2017 12:49:27 -0700 (MST) Subject: Re: Zookeeper session expiration To: user@zookeeper.apache.org References: From: Shawn Heisey Message-ID: Date: Mon, 4 Dec 2017 12:49:26 -0700 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit archived-at: Mon, 04 Dec 2017 19:49:33 -0000 On 12/4/2017 8:22 AM, Anthony Shaya wrote: > My question is related to how session expiration works, I noticed on many of the client machines the times across these machines were all off (by anywhere from 1 minute to 20 minutes - which was resolved after discovery - haven't verified this completely yet). Can this directly affect session expiration within the zookeeper cluster? > > * I read the following in https://wiki.apache.org/hadoop/ZooKeeper/FAQ , "Expirations happens when the cluster does not hear from the client within the specified session timeout period (i.e. no heartbeat).". So in some case it seems like if the times were wrong across the machines its possible one of the clients could of effectively sent a heart beat in the past (not sure about this tbh) and then the cluster expires the session? I make these comments without any knowledge of what ZK code actually does. I am a member of this list because I'm a representative of the Apache Solr project, which uses the ZK client in order to maintain a cluster. IMHO, any software which makes actual decisions based on the timestamps in messages from another system is badly designed. I would hope that the ZK designers know this, and always make any decisions related to time using the clock in the local system only. If ZK's designers did the right thing, then a session timeout would indicate that quite literally no heartbeats were received in X seconds, as measured by the local clock, and the local clock ONLY ... NOT from timestamp information received from another system. Although such a lack of communication could be caused by any number of things, including network hardware failure, one of the most common reasons I have seen for problems like this is extreme java garbage collection pauses in the client software. Situations where the heap is a little bit too small can cause a java program to basically be doing garbage collection constantly, so it doesn't have much time to do anything else, like send heartbeats to ZK servers. Situations where the heap is HUGE and garbage collection is not well tuned can lead to pauses of a minute or longer while Java does a massive full GC. > * I don't have the zookeeper node log for the above time to see what was going on in zookeeper when the cluster determined the session expired. > > * Is there any additional logging I can turn on to troubleshoot zk session expiration issues? Hopefully your ZK clients also have logging. Failing that, you could turn on GC logging for the software with the ZK client (assuming it's a Java client) and find a program or website that can examine the log and give you statistics or a graph of GC pauses. If there is a problem in software using the client and whatever logging is available doesn't help you figure out what's wrong, you're generally going to need to talk to whoever wrote that software for help troubleshooting it. Thanks, Shawn