zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Prabhjot Bharaj <prabhbha...@gmail.com>
Subject Re: Zookeeper JMX monitoring - important parameters
Date Mon, 23 Nov 2015 17:03:22 GMT
Hello Folks,

Request you to share your experiences on this

Thanks,
Prabhjot
On Nov 23, 2015 3:53 PM, "Prabhjot Bharaj" <prabhbharaj@gmail.com> wrote:

> Hello Folks,
>
> I would like to know what are the important zookeeper parameters that can
> be monitored on a zookeeper server via its JMX port. I've setup my 5-node
> zookeeper ensemble with the required steps present on this page:
> https://zookeeper.apache.org/doc/r3.4.6/zookeeperJMX.html#ch_console
>
> After connecting to the JVM via jconsole, I can see the stats. But, I
> would like to know which stats/values we can send to our reporting system
> so that we can be alerted if some vital parameter is showing unexpected
> value.
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> Here is the homework I've done on it:-
>
> *1. *QuorumSize (under ReplicatedServer_id<#myid value>) - Must always be
> equal to the number of nodes in zookeeper.conf.
>
>    1.
>
>       Example MBean -
>       org.apache.ZooKeeperService:name0=ReplicatedServer_id7
>       2.
>
>       Alert - It should never be lower than (floor(n/2) +1). If this
>       happens, the cluster’s health is bad. Alert on this value going lower than
>       (floor(n/2) + 1), where n is the total machines participating in the
>       ensemble
>
> c. Procedure - bounce the servers which are not participating in the
> quorum and see if it changes anything on this attribute
>
> 2. NodeCount (under InMemoryDataTree) - from all the machines in a
> cluster should be equal. This helps us check consistency of nodes in the
> zookeeper cluster.
>
>    1.
>
>       Example MBean -
>       org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader,name3=InMemoryDataTree
>       2.
>
>       Alert - if any of the nodes in the cluster shows a different value
>       than the total number of nodes in the ensemble, fire an alert
>
> c. Procedure - There is no generalised solution for this. This will need
> investigation.
>
> 3. Memory Management -
> a. GarbageCollection - Listing important parameters for monitoring
> garbage collection on the zookeeper server nodes. Any value in this
> section, if it is significantly higher than that of other nodes in the
> ensemble can point to something fishy in the cluster.
> i. ConcurrentMarkSweep time to be monitored across all nodes
> Example MBean - java.lang:type=GarbageCollector,name=ConcurrentMarkSweep
> ii. ParNew time to be monitored across all nodes
> Example MBean - java.lang:type=GarbageCollector,name=ParNew
>
> 4. Leader count - this must be 1 at all times - out of all the
> replica.<#myid values> under ReplicatedServer_id<#myid value> on all
> machines, there should be only 1 leader.
> a. Example MBean -
>
>
> org.apache.ZooKeeperService:name0=ReplicatedServer_id7,name1=replica.7,name2=Leader.
>
>    1.
>
>    Alert - name<x>=Leader should be only 1 from all the nodes reporting
>    data in the cluster - setup an alert on this. If the alert is fired, it
>    means zookeeper went through a split brain. This is a high-risk thing.
>    2.
>
>    Procedure - check if network is all good amongst the machines. If some
>    n/w slowness amongst nodes in a rack, or across a rack (in case zookeeper
>    nodes are placed across racks), then it must be taken care of. Until it is
>    solved, find a good machine which has good n/w connectivity. push a config
>    for adding this new machine in the cluster and remove the existing machine
>    from the cluster.
>
>
>
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
> --------------------------------------------
>
>
> I would like to know if the above parameters for monitoring the cluster
> are sufficient, or did I miss something out ? Request your help in pointing
> me in the right direction. Please feel free to point out any changes in the
> above write-up
>
>
> Thanks,
>
> Prabhjot
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message