Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 62F48797F for ; Tue, 8 Nov 2011 00:28:08 +0000 (UTC) Received: (qmail 52317 invoked by uid 500); 8 Nov 2011 00:28:08 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 52297 invoked by uid 500); 8 Nov 2011 00:28:08 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 52289 invoked by uid 99); 8 Nov 2011 00:28:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2011 00:28:07 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of junrao@gmail.com designates 74.125.82.46 as permitted sender) Received: from [74.125.82.46] (HELO mail-ww0-f46.google.com) (74.125.82.46) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Nov 2011 00:28:01 +0000 Received: by wwe5 with SMTP id 5so7346173wwe.15 for ; Mon, 07 Nov 2011 16:27:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Eg6VerEQJjpUqyhmL/8Uw0dbi1eQMSGX+NFLhTkjtj0=; b=VmuKPE3jplYuvHk3/IfC3Dd3kaOW8R+WdrTZgTCpAN/WuwWkQy9oksacQZWB2bL7W9 iRpvnV9jSiWw9AHWhbyOrj0+HBG3A+rLGm9gkbIznGZnGgFld2qUlJDSa5pK+IRGntKp ic3V3GpVrb/pMMGfa9pn/Nusz0GJp/JO5OUhM= MIME-Version: 1.0 Received: by 10.216.229.85 with SMTP id g63mr7513658weq.53.1320712060216; Mon, 07 Nov 2011 16:27:40 -0800 (PST) Received: by 10.216.17.69 with HTTP; Mon, 7 Nov 2011 16:27:40 -0800 (PST) In-Reply-To: References: Date: Mon, 7 Nov 2011 16:27:40 -0800 Message-ID: Subject: Re: Zookeeper session losing some watchers From: Jun Rao To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=0016e647641c33260604b12e3d58 --0016e647641c33260604b12e3d58 Content-Type: text/plain; charset=ISO-8859-1 Jamie, We do use chroot. However, the chroot problem will lose all watchers, not some watchers, right? Thanks, Jun On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder wrote: > Hi Neha, > > I encountered a similar problem with zookeeper losing watches and found > that it was related to this bug: > > https://issues.apache.org/jira/browse/ZOOKEEPER-961 > > Are you using a chroot? > > Thanks, > Jamie > > On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede >wrote: > > > Hi, > > > > We've been seeing a problem with our zookeeper servers lately, where > > all of a sudden a session loses some of the watchers registered on > > some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka > > cluster in one DC establishing sessions (with 6sec timeout) with a ZK > > cluster (of 4 machines) in another DC and registers watchers on some > > zookeeper paths. Every couple of weeks, we observe some problem with > > the Kafka servers, where on investigating further, we find that the > > session lost some of the key watches, but not all. > > > > The last time this happened, we ran the wchc command on the ZK servers > > and saw the problem. Unfortunately, we lost relevant information from > > the ZK logs by the time we were ready to debug it further. Since this > > causes Kafka servers to stop making progress, we want to setup some > > kind of alert when this happens. This will help us collect more > > information to give you. Particularly, we were thinking about running > > wchp periodically (maybe once a minute), grepping for the ZK paths and > > counting the number of watches that should be registered for correct > > operation. But I observed that the watcher info is not replicated > > across all ZK servers, so we would have to query every ZK server to > > inorder to get the full list. > > > > I'm not sure running wchp periodically on all ZK servers is the best > > option for this alert. Can you think of what could be the problem here > > and how we can setup this alert for now ? > > > > Thanks > > Neha > > > --0016e647641c33260604b12e3d58--