Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 26541 invoked from network); 21 Jul 2009 03:43:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 21 Jul 2009 03:43:40 -0000 Received: (qmail 32760 invoked by uid 500); 21 Jul 2009 03:44:45 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 32675 invoked by uid 500); 21 Jul 2009 03:44:44 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 32665 invoked by uid 99); 21 Jul 2009 03:44:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2009 03:44:44 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [217.12.15.82] (HELO rsmtp2.corp.ukl.yahoo.com) (217.12.15.82) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jul 2009 03:44:32 +0000 Received: from UNKNOWN-10-10-1-125.yahoo.com (snvvpn2-10-72-76-c216.hq.corp.yahoo.com [10.72.76.216]) (authenticated bits=0) by rsmtp2.corp.ukl.yahoo.com (8.13.8/8.13.8/y.rout) with ESMTP id n6L3i5DA006175 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NO) for ; Tue, 21 Jul 2009 03:44:07 GMT DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:from:to:in-reply-to:content-type: content-transfer-encoding:mime-version:subject:date:references:x-mailer; b=pTUU0aQwjJTyKObge6ffw0X50vCHOM55WlHUqrDvq4jyKkD785SwcwnCZ3DUCDbl Message-Id: From: Flavio Junqueira To: zookeeper-user@hadoop.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Leader Elections Date: Mon, 20 Jul 2009 20:44:05 -0700 References: <755F6B792C85484E8665C01CE5E0F7410170999C@corpdc-exch01.corp.digimine.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org For the partial subscription, my take right now is that it should be part of the registration procedure. When an observer joins, it contacts some ensemble server (a follower or the current leader) and appends a path to the initial message it sends to this server. This path corresponds to the subtree it is interested in. The server simply filters the updates it sends to the observer based on the path prefix. With respect to offline modes, we have considered having a replica switching to read-only when it is not part of a quorum (ZOOKEEPER-40). Dealing with reconciliation is messy, and I would say that we do not want to go down this path at this moment. Others might have a different opinion, though. -Flavio On Jul 20, 2009, at 2:46 PM, Henry Robinson wrote: > I think partial subscription for an Observer would be easy to do - I > don't > think it will make it into 368 which is big enough already, but it > would not > be an enormous amount of work. The main thing to do is to figure out > the > protocol for subscription; probably just a new message type. That > said, it > would require some careful stepping around the sync code in order to > make > sure that the the Observer knows what the latest zxid is even if it > doesn't > know the full history. Very do-able though. > > 'Offline' modes for parts of the graph are more challenging; we > would need > to think hard about the right way to implement this. > > I had imagined that Observers would be a good integration point for a > third-party pubsub system (like TIB or something) via a plugin > mechanism. In > my opinion I think it's important for ZK not to try to become a > general > pubsub framework which is not its core goal, although I can't speak > for the > committers. That said, rudimentary subscription is a good idea to > prevent > unnecessary WAN traffic. > > The idea of having ensembles subscribe to each other is a bit tricky; > essentially it would require one ensemble to mirror the other with one > ensemble acting as the master with a netsplit putting the slave > ensemble > into read-only mode (or removing the mounted subtree). Again, I > think it > could be done but would be a big feature. > > Henry > > On Mon, Jul 20, 2009 at 10:21 PM, Scott Carey > wrote: > >> Todd has put it much more eloquently. Comments below: >> >> On 7/20/09 11:50 AM, "Todd Greenwood" >> wrote: >> >>> Flavio, Ted, Henry, Scott, this would perfectly well for my use case >>> provided: >>> >>> SINGLE ENSEMBLE: >>> GROUP A : ZK Servers w/ read/write AND Leader Elections >>> GROUP B : ZK Servers w/ read/write W/O Leader Elections >>> >>> So, we can craft this via Observers and Hiererarchial Quorum groups? >>> Great. Problem solved. >>> >>> When will this be production ready? :o) >>> >>> -------------------- >>> >>> Scott brought up a multi-feature that is very interesting for me. >>> Namely: >>> >>> 1. Offline ZK servers that sync & merge on reconnect >>> >>> The offline servers seems conceptually simple, it's kind of like a >>> messaging system. However, the merge and resolve step when two >>> servers >>> reconnect might be challenging. Cool idea though. >> >> Yes, this is very useful for WAN use cases. I've already done >> something >> like it with a hack: >> Ensemble A "Master/Central" >> "Remote Proxy" N -- embeds its own ZK, and runs two clients. One >> Client >> connects to Ensemble A and watches a partial sub-graph, propagating >> that >> into its local embedded ZK server. This subgraph is read-only for >> those >> that access the Proxy. A second client accesses the local ZK >> server and >> monitors a different subgraph, which it propagates to the Master >> ensemble. >> This is writeable by clients accessing the Proxy and on the Master >> ensemble >> is only written to by this Proxy. >> >> The above is all application enforced. There are constraints on >> what sort >> of >> things can be built with this, but for the subset of use cases I >> need over >> WAN, its more than enough. >> >>> >>> 2. Partial memory graph subscriptions >>> >>> The second idea is partial memory graph subscriptions. This would >>> enable >>> virtual ensembles to interract on the same physical ensemble. For >>> my use >>> case, this would prevent unnecessary cross talk between nodes on a >>> WAN, >>> allowing me to define the subsets of the memory graph that need to >>> be >>> replicated, and to whom. This would be a huge scalability win for >>> WAN >>> use cases. >> >> Yes, a more general partial graph subscription / ownership >> framework would >> allow for not just better WAN scalability but also (and more >> critically >> IMO) >> higher reliability. Often, some large subset of application >> functionality >> is local to one network, and a minority is global and in need of WAN >> communication. In this case, when the WAN breaks one wishes that >> local >> functionality to continue to function, and only those parts truly >> dependant >> on external events to be interrupted. >> Currently one has to have separate ensembles to partition data and >> clunky >> 'bridge' code to intercommunicate. >> >> It would certainly be more natural if two ZK ensembles could >> register with >> each other, in a 'partial sub-graph publish/subscribe' framework. >> It could >> almost be like file system mounting: >> To subscribe: >> subscribe otherEnsemble:port/path/to/otherstuff /localpath/to/ >> mount/into >> >> Publishing is the same thing -- think of it as a request for a >> remote ZK >> cluster to subscribe to the local ZK's data. >> >> >> >> >>> >>> -Todd >>> >>> -----Original Message----- >>> From: Scott Carey [mailto:scott@richrelevance.com] >>> Sent: Monday, July 20, 2009 11:00 AM >>> To: zookeeper-user@hadoop.apache.org >>> Subject: Re: Leader Elections >>> >>> Observers would be awesome especially with a couple enhancements / >>> extensions: >>> >>> An option for the observers to enter a special state if the WAN link >>> goes down to the "master" cluster. A read-only option would be >>> great. >>> However, allowing certain types of writes to continue on a limited >>> basis >>> would be highly valuable as well. An observer could "own" a special >>> node and its subnodes. Only these subnodes would be writable by the >>> observer when there was a session break to the master cluster, and >>> the >>> master cluster would take all the changes when the link is >>> reestablished. Essentially, it is a portion of the hierarchy that >>> is >>> writable only by a specitfic observer, and read-only for others. >>> The purpose of this would be for when the WAN link goes down to the >>> "master" ZKs for certain types of use cases - status updates or >>> other >>> changes local to the observer that are strictly read-only outside >>> the >>> Observer's 'realm'. >>> >>> >>> On 7/19/09 12:16 PM, "Henry Robinson" wrote: >>> >>> You can. See ZOOKEEPER-368 - at first glance it sounds like >>> observers >>> will >>> be a good fit for your requirements. >>> >>> Do bear in mind that the patch on the jira is only for discussion >>> purposes; >>> I would not consider it currently fit for production use. I hope >>> to put >>> up a >>> much better patch this week. >>> >>> Henry >>> >>> On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning >>> wrote: >>> >>>> Can you submit updates via an observer? >>>> >>>> On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira >>> > >>>> wrote: >>>> >>>>> 2- Observers: you could have one computing center containing an >>> ensemble >>>>> and observers around the edge just learning committed values. >>>> >>>> >>>> >>>> >>>> -- >>>> Ted Dunning, CTO >>>> DeepDyve >>>> >>> >>> >> >>