Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA1BA18624 for ; Mon, 23 Nov 2015 10:23:09 +0000 (UTC) Received: (qmail 13277 invoked by uid 500); 23 Nov 2015 10:23:09 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 13227 invoked by uid 500); 23 Nov 2015 10:23:09 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 13215 invoked by uid 99); 23 Nov 2015 10:23:08 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Nov 2015 10:23:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 685D7C2C45 for ; Mon, 23 Nov 2015 10:23:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.908 X-Spam-Level: ** X-Spam-Status: No, score=2.908 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id rF2okIgHnxDm for ; Mon, 23 Nov 2015 10:23:07 +0000 (UTC) Received: from mail-io0-f179.google.com (mail-io0-f179.google.com [209.85.223.179]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 19C4843DB6 for ; Mon, 23 Nov 2015 10:23:07 +0000 (UTC) Received: by iofh3 with SMTP id h3so182509685iof.3 for ; Mon, 23 Nov 2015 02:23:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=w8cYynjtcJzc3Cvbw9ayqX1rQdgl+9afObTrzqY0HQY=; b=VUWV+jH4a75hAMYpY+8N9V2nXpUllkyWcL6ZH6SFdfJjlr1ObVLpbldsN7Zh8l2e/i R4HDslmn2yxcM5kebh3mz148lbRLmtWM/BJNWMv3996k2laRcsrb44s4A5YD1DIWl9pq tOOeetQ++ize7jvOfl7RuOTBVuQBZBHgcgLyp6nu7fpr4IjQg+TfBHX+DeQrvtADj7l7 tFsS7LEvB+z1JPzXV0W5iNXDpgMVmJqwkygsbH5tjAb7yWF9PbZXGWKb5bdg8ZG3gKv5 Ky1UjYJ1S63vbGgDX/y8jBZloSs5nCOTzTxl22TBtwDHBiS+Bf3izIf1194nxeq+hFcf m2hw== MIME-Version: 1.0 X-Received: by 10.107.16.18 with SMTP id y18mr25430284ioi.113.1448274186710; Mon, 23 Nov 2015 02:23:06 -0800 (PST) Received: by 10.79.89.199 with HTTP; Mon, 23 Nov 2015 02:23:06 -0800 (PST) Date: Mon, 23 Nov 2015 15:53:06 +0530 Message-ID: Subject: Zookeeper JMX monitoring - important parameters From: Prabhjot Bharaj To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=001a113fe62e6fb2ea0525329c1c --001a113fe62e6fb2ea0525329c1c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hello Folks, I would like to know what are the important zookeeper parameters that can be monitored on a zookeeper server via its JMX port. I've setup my 5-node zookeeper ensemble with the required steps present on this page: https://zookeeper.apache.org/doc/r3.4.6/zookeeperJMX.html#ch_console After connecting to the JVM via jconsole, I can see the stats. But, I would like to know which stats/values we can send to our reporting system so that we can be alerted if some vital parameter is showing unexpected value. -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- Here is the homework I've done on it:- *1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be equal to the number of nodes in zookeeper.conf. 1. Example MBean - org.apache.ZooKeeperService:name0=3DReplicatedServer_= id7 2. Alert - It should never be lower than (floor(n/2) +1). If this happens, the cluster=E2=80=99s health is bad. Alert on this value goi= ng lower than (floor(n/2) + 1), where n is the total machines participating in the ensemble c. Procedure - bounce the servers which are not participating in the quorum and see if it changes anything on this attribute 2. NodeCount (under InMemoryDataTree) - from all the machines in a cluster should be equal. This helps us check consistency of nodes in the zookeeper cluster. 1. Example MBean - org.apache.ZooKeeperService:name0=3DReplicatedServer_id7,name1=3Drepl= ica.7,name2=3DLeader,name3=3DInMemoryDataTree 2. Alert - if any of the nodes in the cluster shows a different value than the total number of nodes in the ensemble, fire an alert c. Procedure - There is no generalised solution for this. This will need investigation. 3. Memory Management - a. GarbageCollection - Listing important parameters for monitoring garbage collection on the zookeeper server nodes. Any value in this section, if it is significantly higher than that of other nodes in the ensemble can point to something fishy in the cluster. i. ConcurrentMarkSweep time to be monitored across all nodes Example MBean - java.lang:type=3DGarbageCollector,name=3DConcurrentMarkSwee= p ii. ParNew time to be monitored across all nodes Example MBean - java.lang:type=3DGarbageCollector,name=3DParNew 4. Leader count - this must be 1 at all times - out of all the replica.<#myid values> under ReplicatedServer_id<#myid value> on all machines, there should be only 1 leader. a. Example MBean - org.apache.ZooKeeperService:name0=3DReplicatedServer_id7,name1=3Dreplica.7,= name2=3DLeader. 1. Alert - name=3DLeader should be only 1 from all the nodes reporting data in the cluster - setup an alert on this. If the alert is fired, it means zookeeper went through a split brain. This is a high-risk thing. 2. Procedure - check if network is all good amongst the machines. If some n/w slowness amongst nodes in a rack, or across a rack (in case zookeepe= r nodes are placed across racks), then it must be taken care of. Until it = is solved, find a good machine which has good n/w connectivity. push a conf= ig for adding this new machine in the cluster and remove the existing machi= ne from the cluster. -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- -------------------------------------------- I would like to know if the above parameters for monitoring the cluster are sufficient, or did I miss something out ? Request your help in pointing me in the right direction. Please feel free to point out any changes in the above write-up Thanks, Prabhjot --001a113fe62e6fb2ea0525329c1c--