zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamie Rothfeder <jamie.rothfe...@gmail.com>
Subject Re: Zookeeper session losing some watchers
Date Tue, 08 Nov 2011 05:45:52 GMT
Hi Jun,

It depends. You might just reregister the watch on another node (specifically, the original
node minus the chroot). This case is really easy to test, even on a single, locally running
instance. Just create a watch then print out the watches using wchc or wcwp. Restart the zookeeper.
After the client automatically reconnects, rerun the four letter word to observe what happened
to the watch.


On Nov 7, 2011, at 7:27 PM, Jun Rao <junrao@gmail.com> wrote:

> Jamie,
> We do use chroot. However, the chroot problem will lose all watchers, not
> some watchers, right?
> Thanks,
> Jun
> On Wed, Nov 2, 2011 at 7:34 PM, Jamie Rothfeder
> <jamie.rothfeder@gmail.com>wrote:
>> Hi Neha,
>> I encountered a similar problem with zookeeper losing watches and found
>> that it was related to this bug:
>> https://issues.apache.org/jira/browse/ZOOKEEPER-961
>> Are you using a chroot?
>> Thanks,
>> Jamie
>>  Cli
>> On Wed, Nov 2, 2011 at 1:16 PM, Neha Narkhede <neha. @gmail.com
>>> wrote:
>>> Hi,
>>> We've been seeing a problem with our zookeeper servers lately, where
>>> all of a sudden a session loses some of the watchers registered on
>>> some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
>>> cluster in one DC establishing sessions (with 6sec timeout) with a ZK
>>> cluster (of 4 machines) in another DC and registers watchers on some
>>> zookeeper paths. Every couple of weeks, we observe some problem with
>>> the Kafka servers, where on investigating further, we find that the
>>> session lost some of the key watches, but not all.
>>> The last time this happened, we ran the wchc command on the ZK servers
>>> and saw the problem. Unfortunately, we lost relevant information from
>>> the ZK logs by the time we were ready to debug it further. Since this
>>> causes Kafka servers to stop making progress, we want to setup some
>>> kind of alert when this happens. This will help us collect more
>>> information to give you. Particularly, we were thinking about running
>>> wchp periodically (maybe once a minute), grepping for the ZK paths and
>>> counting the number of watches that should be registered for correct
>>> operation. But I observed that the watcher info is not replicated
>>> across all ZK servers, so we would have to query every ZK server to
>>> inorder to get the full list.
>>> I'm not sure running wchp periodically on all ZK servers is the best
>>> option for this alert. Can you think of what could be the problem here
>>> and how we can setup this alert for now ?
>>> Thanks
>>> Neha

View raw message