zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Rothfeder <jamie.rothfe...@gmail.com>
Subject Re: Zookeeper session losing some watchers
Date Tue, 08 Nov 2011 05:45:52 GMT
Hi Jun,

It depends. You might just reregister the watch on another node (specifically, the original
node minus the chroot). This case is really easy to test, even on a single, locally running
instance. Just create a watch then print out the watches using wchc or wcwp. Restart the zookeeper.
After the client automatically reconnects, rerun the four letter word to observe what happened
to the watch.

-Jamie

On Nov 7, 2011, at 7:27 PM, Jun Rao <junrao@gmail.com> wrote:

> Jamie,
> 
> We do use chroot. However, the chroot problem will lose all watchers, not
> some watchers, right?
> 
> Thanks,
> 
> Jun
> 
> On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder
> <jamie.rothfeder@gmail.com>wrote:
> 
>> Hi Neha,
>> 
>> I encountered a similar problem with zookeeper losing watches and found
>> that it was related to this bug:
>> 
>> https://issues.apache.org/jira/browse/ZOOKEEPER-961
>> 
>> Are you using a chroot?
>> 
>> Thanks,
>> Jamie
>>  Cli
>> On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <neha. @gmail.com
>>> wrote:
>> 
>>> Hi,
>>> 
>>> We've been seeing a problem with our zookeeper servers lately, where
>>> all of a sudden a session loses some of the watchers registered on
>>> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
>>> cluster in one DC establishing sessions (with 6sec timeout) with a ZK
>>> cluster (of 4 machines) in another DC and registers watchers on some
>>> zookeeper paths. Every couple of weeks, we observe some problem with
>>> the Kafka servers, where on investigating further, we find that the
>>> session lost some of the key watches, but not all.
>>> 
>>> The last time this happened, we ran the wchc command on the ZK servers
>>> and saw the problem. Unfortunately, we lost relevant information from
>>> the ZK logs by the time we were ready to debug it further. Since this
>>> causes Kafka servers to stop making progress, we want to setup some
>>> kind of alert when this happens. This will help us collect more
>>> information to give you. Particularly, we were thinking about running
>>> wchp periodically (maybe once a minute), grepping for the ZK paths and
>>> counting the number of watches that should be registered for correct
>>> operation. But I observed that the watcher info is not replicated
>>> across all ZK servers, so we would have to query every ZK server to
>>> inorder to get the full list.
>>> 
>>> I'm not sure running wchp periodically on all ZK servers is the best
>>> option for this alert. Can you think of what could be the problem here
>>> and how we can setup this alert for now ?
>>> 
>>> Thanks
>>> Neha
>>> 
>> 

Mime
View raw message