Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 98980 invoked from network); 26 Jul 2010 16:53:43 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Jul 2010 16:53:43 -0000 Received: (qmail 1922 invoked by uid 500); 26 Jul 2010 16:53:42 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 1859 invoked by uid 500); 26 Jul 2010 16:53:42 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 1851 invoked by uid 99); 26 Jul 2010 16:53:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 16:53:42 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.216.48 as permitted sender) Received: from [209.85.216.48] (HELO mail-qw0-f48.google.com) (209.85.216.48) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 26 Jul 2010 16:53:37 +0000 Received: by qwd7 with SMTP id 7so305934qwd.35 for ; Mon, 26 Jul 2010 09:53:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=AgzxCu1z25Fyu0qsQDHt7FHWlbm3YujVz5wUHr3JMd4=; b=jCZZ+BKIDkbm7Uq0dp5AI5WO75ecY0M65ZmNgDxkGsROJNJ+j3hLx9hcI9GpWjzEkr nDpRc5hrxlUN6rGnxM9gDxiwrnaopDHKJdULGIHBA7WOsdlM0mb/+KYvMNT5IzvTzEzt vQLV9M+/z2atKkQ3VZSj4y3jfjq49pk9hKhzo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=EAJ3oNtas12Chty5Vh2BgidW44yI2nSy+YFlKHPhBD/xyYp5jgdDSi74vqfKjwrs+K KsU596RGa95TbI5Ix8h/kKcZtyRbUFmQp1wvwCm6QMS0vP8h1bHfk5UJ70hDh50Qr0ye ckJqjt35CG+1W1B8QwMi61ZAxb1S2WQ+qOuTQ= Received: by 10.224.2.147 with SMTP id 19mr6412127qaj.61.1280163196175; Mon, 26 Jul 2010 09:53:16 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.3.7 with HTTP; Mon, 26 Jul 2010 09:52:56 -0700 (PDT) In-Reply-To: <4C4DB778.7040401@vrijheid.net> References: <4C4DB778.7040401@vrijheid.net> From: Ted Dunning Date: Mon, 26 Jul 2010 09:52:56 -0700 Message-ID: Subject: Re: node symlinks To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015175ca8409006ae048c4d3864 --0015175ca8409006ae048c4d3864 Content-Type: text/plain; charset=UTF-8 So ZK is going to act like a file meta-data store and the number of files might scale to a very large number. For me, 5 billion files sounds like a large number and this seems to imply ZK storage of 50-500GB. If you assume 8GB usable space per machine, a fully scaled system would require 6-60 ZK clusters. If you start with 1 cluster and scale by a factor of four at each expansion step, this will require 4 expansions. I think that the easy way is to simply hash your file names to pick a cluster. You should have a central facility (ZK of course) that maintains a history of hash seeds that have been used for cluster cluster configurations that still have live files. The process for expansion would be: a) bring up the new clusters. b) add a new hash seed/number of clusters. All new files will be created according to this new scheme. Old files will still be in their old places. c) start a scan of all file meta-data records on the old clusters to move them to where they should live in the current hashing. When this scan finishes, you can retire the old hash seed. Since each ZK would only contain at most a few hundred million entries, you should be able to complete this scan in a day or so even if you are only scanning at a rate of a thousand entries per second. Since the scans of the old cluster might take quite a while and you might even have two expansions before a scan is done, finding a file will consist of probing current and old but still potentially active locations. This is the cost of the move-after-expansion strategy, but it can be hard to build consistent systems without this old/new hash idea. Normally I recommend micro-sharding to avoid one-by-one object motion, but that wouldn't really work with a ZK base. A more conventional approach would be to use Voldemort or Cassandra. Voldemort especially has some very nice expansion/resharding capabilities and is very fast. It wouldn't necessarily give you the guarantees of ZK, but it is a pretty effective solution that avoids you having to implement the scaling of the storage layer. Also, the more you can store meta-data for multiple files in a single Znode, the better off you will be in terms of memory efficiency. On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans wrote: > > Hi Mahadev, > > My use is mapping a flat object store (like S3) to a filesystem and opening > it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds > to a collection or a file), and is used for locking and provides the pointer > to the actual data object in e.g. S3 > > A "symlink" could just be dialected in the ZK node - my tree traversal can > recurses and can be made cluster aware. That way, I don't need a special > central table. > > Does this clarify? The # nodes might grow rapidly with more users, and I > need to grow between users and filesystems. > > Best, Maarten > > On 07/26/2010 06:12 PM, Mahadev Konar wrote: > >> HI Maarteen, >> Can you elaborate on your use case of ZooKeeper? We currently don't have >> any symlinks feature in zookeeper. The only way to do it for you would be >> a >> client side hash/lookup table that buckets data to different zookeeper >> servers. >> >> Or you could also store this hash/lookup table in one of the zookeeper >> clusters. This lookup table can then be cached on the client side after >> reading it once from zookeeper servers. >> >> Thanks >> mahadev >> >> >> On 7/24/10 2:39 PM, "Maarten Koopmans" wrote: >> >> Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees >>> as it will provide the file system hierarchy to a flat object store so I >>> need locking primitives and consistency. Doing that on top of Voldemort >>> will give me a scalable version of ZK, but just slower. Might as well >>> find a way to scale across ZK clusters. >>> >>> Also, I want to be able to add clusters as the number of nodes grows. >>> Note that the #nodes will grow with the #users of the system, so the >>> clusters can grow sequentially, hence the symlink idea. >>> >>> --Maarten >>> >>> On 07/24/2010 11:12 PM, Ted Dunning wrote: >>> >>>> Depending on your application, it might be good to simply hash the node >>>> name >>>> to decide which ZK cluster to put it on. >>>> >>>> Also, a scalable key value store like Voldemort or Cassandra might be >>>> more >>>> appropriate for your application. Unless you need the hard-core >>>> guarantees >>>> of ZK, they can be better for large scale storage. >>>> >>>> On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans>>> >wrote: >>>> >>>> Hi, >>>>> >>>>> I have a number of nodes that will grow larger than one cluster can >>>>> hold, >>>>> so I am looking for a way to efficiently stack clusters. One way is to >>>>> have >>>>> a zookeeper node "symlink" to another cluster. >>>>> >>>>> Has anybody ever done that and some tips, or alternative approaches? >>>>> Currently I use Scala, and traverse zookeeper trees by proper tail >>>>> recursion, so adapting the tail recursion to process "symlinks" would >>>>> be my >>>>> approach. >>>>> >>>>> Bst, Maarten >>>>> >>>>> >>>> >>> >> >> >> > --0015175ca8409006ae048c4d3864--