Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.216.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=EAJ3oNtas12Chty5Vh2BgidW44yI2nSy+YFlKHPhBD/xyYp5jgdDSi74vqfKjwrs+K
         KsU596RGa95TbI5Ix8h/kKcZtyRbUFmQp1wvwCm6QMS0vP8h1bHfk5UJ70hDh50Qr0ye
         ckJqjt35CG+1W1B8QwMi61ZAxb1S2WQ+qOuTQ=
MIME-Version: 1.0
In-Reply-To: <4C4DB778.7040401@vrijheid.net>
References: <C87301FF.3BB36%mahadev@yahoo-inc.com>
 <4C4DB778.7040401@vrijheid.net>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Mon, 26 Jul 2010 09:52:56 -0700
Message-ID: <AANLkTinT-oiMKhzpV136wR88+OhLDUtYS=S1jDPkryqS@mail.gmail.com>
Subject: Re: node symlinks
To: zookeeper-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015175ca8409006ae048c4d3864

--0015175ca8409006ae048c4d3864
Content-Type: text/plain; charset=UTF-8

So ZK is going to act like a file meta-data store and the number of files
might scale to a very large number.

For me, 5 billion files sounds like a large number and this seems to imply
ZK storage of 50-500GB.  If you assume 8GB usable space per machine, a fully
scaled system would require 6-60 ZK clusters.  If you start with 1 cluster
and scale by a factor of four at each expansion step, this will require 4
expansions.

I think that the easy way is to simply hash your file names to pick a
cluster.  You should have a central facility (ZK of course) that maintains a
history of hash seeds that have been used for cluster cluster configurations
that still have live files.  The process for expansion would be:

a) bring up the new clusters.

b) add a new hash seed/number of clusters.  All new files will be created
according to this new scheme.  Old files will still be in their old places.

c) start a scan of all file meta-data records on the old clusters to move
them to where they should live in the current hashing.  When this scan
finishes, you can retire the old hash seed.  Since each ZK would only
contain at most a few hundred million entries, you should be able to
complete this scan in a day or so even if you are only scanning at a rate of
a thousand entries per second.

Since the scans of the old cluster might take quite a while and you might
even have two expansions before a scan is done, finding a file will consist
of probing current and old but still potentially active locations.  This is
the cost of the move-after-expansion strategy, but it can be hard to build
consistent systems without this old/new hash idea.  Normally I recommend
micro-sharding to avoid one-by-one object motion, but that wouldn't really
work with a ZK base.

A more conventional approach would be to use Voldemort or Cassandra.
 Voldemort especially has some very nice expansion/resharding capabilities
and is very fast.  It wouldn't necessarily give you the guarantees of ZK,
but it is a pretty effective solution that avoids you having to implement
the scaling of the storage layer.

Also, the more you can store meta-data for multiple files in a single Znode,
the better off you will be in terms of memory efficiency.


On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans <maarten@vrijheid.net>wrote:

>
> Hi Mahadev,
>
> My use is mapping a flat object store (like S3) to a filesystem and opening
> it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
> to a collection or a file), and is used for locking and provides the pointer
> to the actual data object in e.g. S3
>
> A "symlink" could just be dialected in the ZK node - my tree traversal can
> recurses and can be made cluster aware. That way, I don't need a special
> central table.
>
> Does this clarify? The # nodes might grow rapidly with more users, and I
> need to grow between users and filesystems.
>
> Best, Maarten
>
> On 07/26/2010 06:12 PM, Mahadev Konar wrote:
>
>> HI Maarteen,
>>   Can you elaborate on your use case of ZooKeeper? We currently don't have
>> any symlinks feature in zookeeper. The only way to do it for you would be
>> a
>> client side hash/lookup table that buckets data to different zookeeper
>> servers.
>>
>> Or you could also store this hash/lookup table in one of the zookeeper
>> clusters. This lookup table can then be cached on the client side after
>> reading it once from zookeeper servers.
>>
>> Thanks
>> mahadev
>>
>>
>> On 7/24/10 2:39 PM, "Maarten Koopmans"<maarten@vrijheid.net>  wrote:
>>
>>  Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
>>> as it will provide the file system hierarchy to a flat object store so I
>>> need locking primitives and consistency. Doing that on top of Voldemort
>>> will give me a scalable version of ZK, but just slower. Might as well
>>> find a way to scale across ZK clusters.
>>>
>>> Also, I want to be able to add clusters as the number of nodes grows.
>>> Note that the #nodes will grow with the #users of the system, so the
>>> clusters can grow sequentially, hence the symlink idea.
>>>
>>> --Maarten
>>>
>>> On 07/24/2010 11:12 PM, Ted Dunning wrote:
>>>
>>>> Depending on your application, it might be good to simply hash the node
>>>> name
>>>> to decide which ZK cluster to put it on.
>>>>
>>>> Also, a scalable key value store like Voldemort or Cassandra might be
>>>> more
>>>> appropriate for your application.  Unless you need the hard-core
>>>> guarantees
>>>> of ZK, they can be better for large scale storage.
>>>>
>>>> On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans<maarten@vrijheid.net
>>>> >wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>> I have a number of nodes that will grow larger than one cluster can
>>>>> hold,
>>>>> so I am looking for a way to efficiently stack clusters. One way is to
>>>>> have
>>>>> a zookeeper node "symlink" to another cluster.
>>>>>
>>>>> Has anybody ever done that and some tips, or alternative approaches?
>>>>> Currently I use Scala, and traverse zookeeper trees by proper tail
>>>>> recursion, so adapting the tail recursion to process "symlinks" would
>>>>> be my
>>>>> approach.
>>>>>
>>>>> Bst, Maarten
>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>

--0015175ca8409006ae048c4d3864--