Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 82098 invoked from network); 19 Nov 2008 16:24:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Nov 2008 16:24:52 -0000 Received: (qmail 23700 invoked by uid 500); 19 Nov 2008 16:25:01 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 23529 invoked by uid 500); 19 Nov 2008 16:25:00 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 23520 invoked by uid 99); 19 Nov 2008 16:25:00 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2008 08:25:00 -0800 X-ASF-Spam-Status: No, hits=-1999.6 required=10.0 tests=ALL_TRUSTED,SUBJECT_FUZZY_TION X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Nov 2008 16:23:46 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 99C3E118DC for ; Wed, 19 Nov 2008 16:24:01 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Wed, 19 Nov 2008 16:24:01 -0000 Message-ID: <20081119162401.14790.87649@eos.apache.org> Subject: [Hadoop Wiki] Update of "ZooKeeper/PartitionedZookeeper" by FlavioJunqueira X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by FlavioJunqueira: http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper New page: == Paritioned Zookeeper == === 10,000 ft view === Our main goal is to enable write throughput scalability through partitions in ZooKeeper. The overall idea is to split a set of zookeeper servers into overlapping ensembles, and have each partition handle a portion of the ZooKeeper state. By having distinct ensembles handling different portions of the state, we end up relaxing the ordering guarantees that we have with plain ZooKeeper. To overcome this problem, we provide an abstraction that we call "containers". Containers are subtrees of znodes which require that all update operations are ordered as with plain ZooKeeper. === Containers === Containers are subtrees of znodes, and the root of a container is not necessarily the root of a ZooKeeper tree. Upon the creation of a node, we state whether to create a new container for that node or not. === Changes to the API === The only change we envision is an extra parameter on create that tells whether to create a new container or not for the new node. All other operations should be the same. === Internal changes === Internally, we will require more changes: 1. '''Routing''': We need a mechanism to route requests to the correct ensemble. We can perform it in a distributed fashion, as with DHTs, or we can have one ZooKeeper ensemble responsible for mapping prefixes to ensembles; 1. '''Containers''': A ZooKeeper server with this approach has to store and handle requests for a set of containers. Such a set may contain containers from different partitions. Handling different subsets for different partitions does not necessarily imply having multiple instances of the ZooKeeper server on a single machine because containers are disjoint by definition and they can be operated upon in parallel. We just have to make sure that we identify a container correctly when executing a ZooKeeper operation; 1. '''Failure and Recovery''': Upon failure of a server, the immediate neighbor of that server should take over the position of that server. It might be necessary to transfer new containers to the server that is taking over.