Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 29730 invoked from network); 15 Apr 2010 02:44:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 15 Apr 2010 02:44:15 -0000 Received: (qmail 46729 invoked by uid 500); 15 Apr 2010 02:44:15 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 46663 invoked by uid 500); 15 Apr 2010 02:44:15 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 46654 invoked by uid 99); 15 Apr 2010 02:44:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 02:44:15 +0000 X-ASF-Spam-Status: No, hits=0.6 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of traviscrawford@gmail.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-gy0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 15 Apr 2010 02:44:09 +0000 Received: by gyf1 with SMTP id 1so453409gyf.35 for ; Wed, 14 Apr 2010 19:43:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type; bh=paBQiTM9jbxt8ugdOlOwY4Uhn9BD5sLfXJy7rP+xPiw=; b=rwTrcbBdM8pNITAfM8CtmkxPaK3bq+ofxOLG0xcKnzhnlAXt/ulpXR5WwL7xqFNyIy wP4VrbXtbvUK18t0FX1kUGgoDsivDQ4/JEfiCwxjZlQEituPOEppmMqM+NBh76pZmOcA LRhDn5zgAwU17XhJ4e2S90HboT0iRt7rIbBVc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=nR1gkIgdQ/J2VaGDlyLyeI/J7hqUC7fGYmOseL171c1Cz90BdnrsA63EY7Jr+SjKbb C+kEMKJ2J57NNJD3NX1MADem29OdYfGsscq+fByD/TgiUBjrzqNyTV/gKoPip9kZXi9v VALQsJwnTZZxoJArVU23S36gKQ5HjYm0m6C0o= MIME-Version: 1.0 Received: by 10.90.81.4 with HTTP; Wed, 14 Apr 2010 19:43:48 -0700 (PDT) In-Reply-To: References: Date: Wed, 14 Apr 2010 19:43:48 -0700 Received: by 10.91.129.4 with SMTP id g4mr462087agn.11.1271299428746; Wed, 14 Apr 2010 19:43:48 -0700 (PDT) Message-ID: Subject: Re: monitoring zookeeper From: Travis Crawford To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6465434da86e504843d76bc --0016e6465434da86e504843d76bc Content-Type: text/plain; charset=ISO-8859-1 Hey Kishore - Thanks for the info. I found an interesting library called jmetric ( http://code.google.com/p/jmxetric) that reads MBeans and publishes their contents to Ganglia and its working pretty well. A simplified config looks like: It doesn't solve the nested property issue, unfortunately, so I may have to flatten some statistics as you have. I'm interested in checking out your code if you don't mind. At a higher level, I'm interested in setting up the sort of monitoring one would expect of a critical datacenter service. To start with, I'd like to collect data necessary to: - page when there's no leader - page when minimum number of replicas to reach quorum are present - email when replicas are missing, but still above quorum minimum. For example, send an email when 1/5 are down, and page when 2/5 are down. Also page if there's no leader for some other reason. The operational metrics like latencies, connections, requests would be useful in troubleshooting issues as well as capacity planning. --travis On Wed, Apr 14, 2010 at 4:50 PM, kishore g wrote: > Hi Travis, > > We do monitor zookeeper using JMX. We have a simple code which does the > following > > - parse JMX output and convert the output into key value format. The > nested properties are flattened. > - Emit the key values using LWES[ http://www.lwes.org/] Api's at regular > interval[configurable] > - The keys to be emitted can be configured via config file. > > We have our own internal reporting framework which displays these metrics. > In order to differentiate between leader and follower we use separate keys > to > > ReplicatedServer_idXXX_replica.XXX_Follower.AvgRequestLatency=rsf_mrl > ReplicatedServer_idXXX_replica.XXX_Leader.AvgRequestLatency=rsl_mrl > > If the server is leader then rsf_mrl will be empty and vice versa. I can > provide the code to do this and you can probably change it to meet your > needs and enhance it to work for Ganglia. Let me know if this helps you. > > thanks, > Kishore G > > On Wed, Apr 14, 2010 at 11:12 AM, Travis Crawford > wrote: > > > Hey zookeeper gurus - > > > > Are there any recommended ways for one to monitor zookeeper ensembles? > I'm > > familiar with the four-letter words and that stats are published via JMX > - > > I'm more interested in what people are doing with those stats. > > > > I'd like to publish the JMX stats to Ganglia, and this works well for the > > built-in stats. However, the zookeeper-specific names appear to be > dynamic > > which causes issues when deciding what to publish. For example, the > current > > mode (leader/follower) appears to only be accessible from the bean names, > > instead of looking at, say, a "mode" stat. > > > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id1,name1=replica.1,name2=Follower > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader > > > > > > The only way I've found to learn if replicas are up-to-date is looking at > > "synced" buried in followerInfo: > > > > $ java -jar cmdline-jmxclient-0.10.5.jar - localhost:8081 > > > > > org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=replica.2,name2=Leader > > followerInfo > > 04/14/2010 18:06:06 +0000 org.archive.jmx.Client followerInfo: > > FollowerHandler Socket[addr=/10.0.0.10,port=48104,localport=2888] > > tickOfLastAck:29793 synced?:true queuedPacketLength:0 > > FollowerHandler Socket[addr=/10.0.0.11,port=59599,localport=2888] > > tickOfLastAck:29793 synced?:true queuedPacketLength:0 > > > > > > I don't mind writing a tool to parse the JMX output and publishing to > > Ganglia if needed, but it seems like a problem that may have already been > > solved and I'm curious what others are doing. The tool would basically > take > > the zookeeper stats, normalize the names, and publish to a timeseries > > database. > > > > Is anyone already monitoring ZK in a way others might find useful? > > > > Thanks! > > Travis > > > --0016e6465434da86e504843d76bc--