Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 54812 invoked from network); 20 Jul 2009 21:20:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Jul 2009 21:20:48 -0000 Received: (qmail 78936 invoked by uid 500); 20 Jul 2009 21:21:53 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 78871 invoked by uid 500); 20 Jul 2009 21:21:52 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 78861 invoked by uid 99); 20 Jul 2009 21:21:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jul 2009 21:21:52 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [64.78.17.18] (HELO EXHUB018-3.exch018.msoutlookonline.net) (64.78.17.18) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Jul 2009 21:21:43 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-3.exch018.msoutlookonline.net ([64.78.17.18]) with mapi; Mon, 20 Jul 2009 14:21:23 -0700 From: Scott Carey To: "zookeeper-user@hadoop.apache.org" Date: Mon, 20 Jul 2009 14:21:19 -0700 Subject: Re: Leader Elections Thread-Topic: Leader Elections Thread-Index: AcoIpYD8bO/PD+jQSxedI7hz5VNf0QAvnBp6AAFoK+AABZzqHQ== Message-ID: In-Reply-To: <755F6B792C85484E8665C01CE5E0F7410170999C@corpdc-exch01.corp.digimine.com> Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org Todd has put it much more eloquently. Comments below: On 7/20/09 11:50 AM, "Todd Greenwood" wrote: > Flavio, Ted, Henry, Scott, this would perfectly well for my use case > provided: >=20 > SINGLE ENSEMBLE: > GROUP A : ZK Servers w/ read/write AND Leader Elections > GROUP B : ZK Servers w/ read/write W/O Leader Elections >=20 > So, we can craft this via Observers and Hiererarchial Quorum groups? > Great. Problem solved. >=20 > When will this be production ready? :o) >=20 > -------------------- >=20 > Scott brought up a multi-feature that is very interesting for me. > Namely: >=20 > 1. Offline ZK servers that sync & merge on reconnect >=20 > The offline servers seems conceptually simple, it's kind of like a > messaging system. However, the merge and resolve step when two servers > reconnect might be challenging. Cool idea though. Yes, this is very useful for WAN use cases. I've already done something like it with a hack: Ensemble A "Master/Central" "Remote Proxy" N -- embeds its own ZK, and runs two clients. One Client connects to Ensemble A and watches a partial sub-graph, propagating that into its local embedded ZK server. This subgraph is read-only for those that access the Proxy. A second client accesses the local ZK server and monitors a different subgraph, which it propagates to the Master ensemble. This is writeable by clients accessing the Proxy and on the Master ensemble is only written to by this Proxy. The above is all application enforced. There are constraints on what sort o= f things can be built with this, but for the subset of use cases I need over WAN, its more than enough. >=20 > 2. Partial memory graph subscriptions >=20 > The second idea is partial memory graph subscriptions. This would enable > virtual ensembles to interract on the same physical ensemble. For my use > case, this would prevent unnecessary cross talk between nodes on a WAN, > allowing me to define the subsets of the memory graph that need to be > replicated, and to whom. This would be a huge scalability win for WAN > use cases. Yes, a more general partial graph subscription / ownership framework would allow for not just better WAN scalability but also (and more critically IMO= ) higher reliability. Often, some large subset of application functionality is local to one network, and a minority is global and in need of WAN communication. In this case, when the WAN breaks one wishes that local functionality to continue to function, and only those parts truly dependant on external events to be interrupted. Currently one has to have separate ensembles to partition data and clunky 'bridge' code to intercommunicate. It would certainly be more natural if two ZK ensembles could register with each other, in a 'partial sub-graph publish/subscribe' framework. It could almost be like file system mounting: To subscribe: subscribe otherEnsemble:port/path/to/otherstuff /localpath/to/mount/into Publishing is the same thing -- think of it as a request for a remote ZK cluster to subscribe to the local ZK's data. =20 >=20 > -Todd >=20 > -----Original Message----- > From: Scott Carey [mailto:scott@richrelevance.com] > Sent: Monday, July 20, 2009 11:00 AM > To: zookeeper-user@hadoop.apache.org > Subject: Re: Leader Elections >=20 > Observers would be awesome especially with a couple enhancements / > extensions: >=20 > An option for the observers to enter a special state if the WAN link > goes down to the "master" cluster. A read-only option would be great. > However, allowing certain types of writes to continue on a limited basis > would be highly valuable as well. An observer could "own" a special > node and its subnodes. Only these subnodes would be writable by the > observer when there was a session break to the master cluster, and the > master cluster would take all the changes when the link is > reestablished. Essentially, it is a portion of the hierarchy that is > writable only by a specitfic observer, and read-only for others. > The purpose of this would be for when the WAN link goes down to the > "master" ZKs for certain types of use cases - status updates or other > changes local to the observer that are strictly read-only outside the > Observer's 'realm'. >=20 >=20 > On 7/19/09 12:16 PM, "Henry Robinson" wrote: >=20 > You can. See ZOOKEEPER-368 - at first glance it sounds like observers > will > be a good fit for your requirements. >=20 > Do bear in mind that the patch on the jira is only for discussion > purposes; > I would not consider it currently fit for production use. I hope to put > up a > much better patch this week. >=20 > Henry >=20 > On Sat, Jul 18, 2009 at 7:38 PM, Ted Dunning > wrote: >=20 >> Can you submit updates via an observer? >>=20 >> On Sat, Jul 18, 2009 at 6:38 AM, Flavio Junqueira >> wrote: >>=20 >>> 2- Observers: you could have one computing center containing an > ensemble >>> and observers around the edge just learning committed values. >>=20 >>=20 >>=20 >>=20 >> -- >> Ted Dunning, CTO >> DeepDyve >>=20 >=20 >=20