zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maarten Koopmans <maar...@vrijheid.net>
Subject Re: node symlinks
Date Mon, 26 Jul 2010 20:44:51 GMT
Ted,

Thanks for you thinking along with me, your line of thought is what I 
originally had in mind, but I have some boundary conditions that I think 
make things subtly different. I am curious as to what you think.

First, I think your numbers are right. Even so, every multiple of that 
number could be solved with copies from there on. The thing that makes 
things slightly different is the way I have organized the application. 
Every user is loaded from a central (also ZK) cluster. This includes the 
users file metadata "root" ZK cluster.

A filesystem in ZK then is always user based, i.e. your filesystem 
structure /foo/bar equates to /user/foo/bar in your ZK data cluster, say 
ZK-I

Now, with an average of 10M nodes in a cluster and one node equating to 
1 file, the assumption is that 500 users can run on ZK-I (this averages 
to 20K files/user, which is quite a lot for off site storage). However, 
in a way this is a "bet" - if a few users suddenly copy large data sets, 
you're in a tough place. Let's say this happens, and ZK-1 hits the 85% 
utilized mark. At that point we start ZK-II as overflow, create user 
data space for users that upload new data and "attach" ZK-II via a 
symlink to ZK-I (the attaching will have to be done by the same process 
that monitors load). ZK-I is in "add symlink only" mode now (and has 15% 
left to create collections that point to ZK-II, ZK-III etc. There will 
be a notion of the "current overflow cluster").

So, /user/foo/bar/more points to ZK-II /user/@more/ and can be retrieved 
via the client lib that just traverses the tree. Note that new users can 
be added to ZK-II as well, and the whole scheme can be repeated for 
ZK-III. Once you know the root cluster for a specific user, it's just 
traversal (and maybe memcache).

This can only be done as long  on a user level the data is partitioned 
in smaller sets, say 10-100k files /and/ you know the root ZK store. In 
other words, the 5B is partitioned. Also, the copy-on-new-cluster cost 
disappears in this scenario (bursts are handled better).

--Maarten

On 07/26/2010 06:52 PM, Ted Dunning wrote:
> So ZK is going to act like a file meta-data store and the number of files
> might scale to a very large number.
>
> For me, 5 billion files sounds like a large number and this seems to imply
> ZK storage of 50-500GB.  If you assume 8GB usable space per machine, a fully
> scaled system would require 6-60 ZK clusters.  If you start with 1 cluster
> and scale by a factor of four at each expansion step, this will require 4
> expansions.
>
> I think that the easy way is to simply hash your file names to pick a
> cluster.  You should have a central facility (ZK of course) that maintains a
> history of hash seeds that have been used for cluster cluster configurations
> that still have live files.  The process for expansion would be:
>
> a) bring up the new clusters.
>
> b) add a new hash seed/number of clusters.  All new files will be created
> according to this new scheme.  Old files will still be in their old places.
>
> c) start a scan of all file meta-data records on the old clusters to move
> them to where they should live in the current hashing.  When this scan
> finishes, you can retire the old hash seed.  Since each ZK would only
> contain at most a few hundred million entries, you should be able to
> complete this scan in a day or so even if you are only scanning at a rate of
> a thousand entries per second.
>
> Since the scans of the old cluster might take quite a while and you might
> even have two expansions before a scan is done, finding a file will consist
> of probing current and old but still potentially active locations.  This is
> the cost of the move-after-expansion strategy, but it can be hard to build
> consistent systems without this old/new hash idea.  Normally I recommend
> micro-sharding to avoid one-by-one object motion, but that wouldn't really
> work with a ZK base.
>
> A more conventional approach would be to use Voldemort or Cassandra.
>   Voldemort especially has some very nice expansion/resharding capabilities
> and is very fast.  It wouldn't necessarily give you the guarantees of ZK,
> but it is a pretty effective solution that avoids you having to implement
> the scaling of the storage layer.
>
> Also, the more you can store meta-data for multiple files in a single Znode,
> the better off you will be in terms of memory efficiency.
>
>
>
> On Mon, Jul 26, 2010 at 9:27 AM, Maarten Koopmans<maarten@vrijheid.net>wrote:
>
>>
>> Hi Mahadev,
>>
>> My use is mapping a flat object store (like S3) to a filesystem and opening
>> it up via WebDAV. So Zookeeper mirror the filesystem (each node corresponds
>> to a collection or a file), and is used for locking and provides the pointer
>> to the actual data object in e.g. S3
>>
>> A "symlink" could just be dialected in the ZK node - my tree traversal can
>> recurses and can be made cluster aware. That way, I don't need a special
>> central table.
>>
>> Does this clarify? The # nodes might grow rapidly with more users, and I
>> need to grow between users and filesystems.
>>
>> Best, Maarten
>>
>> On 07/26/2010 06:12 PM, Mahadev Konar wrote:
>>
>>> HI Maarteen,
>>>    Can you elaborate on your use case of ZooKeeper? We currently don't have
>>> any symlinks feature in zookeeper. The only way to do it for you would be
>>> a
>>> client side hash/lookup table that buckets data to different zookeeper
>>> servers.
>>>
>>> Or you could also store this hash/lookup table in one of the zookeeper
>>> clusters. This lookup table can then be cached on the client side after
>>> reading it once from zookeeper servers.
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 7/24/10 2:39 PM, "Maarten Koopmans"<maarten@vrijheid.net>   wrote:
>>>
>>>   Yes, I thought about Cassandra or Voldemort, but I need ZKs guarantees
>>>> as it will provide the file system hierarchy to a flat object store so I
>>>> need locking primitives and consistency. Doing that on top of Voldemort
>>>> will give me a scalable version of ZK, but just slower. Might as well
>>>> find a way to scale across ZK clusters.
>>>>
>>>> Also, I want to be able to add clusters as the number of nodes grows.
>>>> Note that the #nodes will grow with the #users of the system, so the
>>>> clusters can grow sequentially, hence the symlink idea.
>>>>
>>>> --Maarten
>>>>
>>>> On 07/24/2010 11:12 PM, Ted Dunning wrote:
>>>>
>>>>> Depending on your application, it might be good to simply hash the node
>>>>> name
>>>>> to decide which ZK cluster to put it on.
>>>>>
>>>>> Also, a scalable key value store like Voldemort or Cassandra might be
>>>>> more
>>>>> appropriate for your application.  Unless you need the hard-core
>>>>> guarantees
>>>>> of ZK, they can be better for large scale storage.
>>>>>
>>>>> On Sat, Jul 24, 2010 at 7:30 AM, Maarten Koopmans<maarten@vrijheid.net
>>>>>> wrote:
>>>>>
>>>>>   Hi,
>>>>>>
>>>>>> I have a number of nodes that will grow larger than one cluster can
>>>>>> hold,
>>>>>> so I am looking for a way to efficiently stack clusters. One way
is to
>>>>>> have
>>>>>> a zookeeper node "symlink" to another cluster.
>>>>>>
>>>>>> Has anybody ever done that and some tips, or alternative approaches?
>>>>>> Currently I use Scala, and traverse zookeeper trees by proper tail
>>>>>> recursion, so adapting the tail recursion to process "symlinks" would
>>>>>> be my
>>>>>> approach.
>>>>>>
>>>>>> Bst, Maarten
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>


Mime
View raw message