zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prabhjot Bharaj <prabhbha...@gmail.com>
Subject Zookeeper JMX monitoring - important parameters
Date Mon, 23 Nov 2015 10:23:06 GMT
Hello Folks,

I would like to know what are the important zookeeper parameters that can
be monitored on a zookeeper server via its JMX port. I've setup my 5-node
zookeeper ensemble with the required steps present on this page:

After connecting to the JVM via jconsole, I can see the stats. But, I would
like to know which stats/values we can send to our reporting system so that
we can be alerted if some vital parameter is showing unexpected value.
Here is the homework I've done on it:-

*1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be
equal to the number of nodes in zookeeper.conf.


      Example MBean - org.apache.ZooKeeperService:name0=ReplicatedServer_id7

      Alert - It should never be lower than (floor(n/2) +1). If this
      happens, the cluster’s health is bad. Alert on this value going
lower than
      (floor(n/2) + 1), where n is the total machines participating in the

c. Procedure - bounce the servers which are not participating in the quorum
and see if it changes anything on this attribute

2. NodeCount (under InMemoryDataTree) - from all the machines in a cluster
should be equal. This helps us check consistency of nodes in the zookeeper


      Example MBean -

      Alert - if any of the nodes in the cluster shows a different value
      than the total number of nodes in the ensemble, fire an alert

c. Procedure - There is no generalised solution for this. This will need

3. Memory Management -
a. GarbageCollection - Listing important parameters for monitoring garbage
collection on the zookeeper server nodes. Any value in this section, if it
is significantly higher than that of other nodes in the ensemble can point
to something fishy in the cluster.
i. ConcurrentMarkSweep time to be monitored across all nodes
Example MBean - java.lang:type=GarbageCollector,name=ConcurrentMarkSweep
ii. ParNew time to be monitored across all nodes
Example MBean - java.lang:type=GarbageCollector,name=ParNew

4. Leader count - this must be 1 at all times - out of all the
replica.<#myid values> under ReplicatedServer_id<#myid value> on all
machines, there should be only 1 leader.
a. Example MBean -



   Alert - name<x>=Leader should be only 1 from all the nodes reporting
   data in the cluster - setup an alert on this. If the alert is fired, it
   means zookeeper went through a split brain. This is a high-risk thing.

   Procedure - check if network is all good amongst the machines. If some
   n/w slowness amongst nodes in a rack, or across a rack (in case zookeeper
   nodes are placed across racks), then it must be taken care of. Until it is
   solved, find a good machine which has good n/w connectivity. push a config
   for adding this new machine in the cluster and remove the existing machine
   from the cluster.


I would like to know if the above parameters for monitoring the cluster are
sufficient, or did I miss something out ? Request your help in pointing me
in the right direction. Please feel free to point out any changes in the
above write-up



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message