zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Fines <scottfi...@gmail.com>
Subject Re: Shared block storage via ZooKepper
Date Wed, 13 Jul 2011 18:25:47 GMT
Cassandra and/or HBase would work pretty well for this, it sounds like,
though I'm not sure that HBase satisfies your hardware requirements.

Project Voldemort might also be a good option, though you'd suffer if you
tried to get groups of blocks at the same time.

If I were writing it, and there was some information regarding a good
grouping policy, I would probably use Cassandra and store each block in a
single column. Of course, you could also store each block in a row with a
single column, which would also work, depending on your access patterns. If
you used this, you would probably only use ZooKeeper for:

1. Transactional support (or for row locking, at least)
2. Cassandra node discovery (for automated discovery of scaled out machines)
3. failure detection(?)

Since Cassandra doesn't have necessarily strong consistency guarantees,
ZooKeeper could also be used as an ordering-provider to create
happens-before relationships.

It all just depends on what you're actually trying to do, I think.

Scott Fines

On Wed, Jul 13, 2011 at 1:13 PM, Simon Felix <de@iru.ch> wrote:

> Thanks for the suggestion but I gues I cannot use MapR for my purpose. I’m
> working on a non-commercial hobby project that one day I might make
> commercial. I believe what I want to use/build is simpler than a distributed
> file system because I don’t have to care about:
> -          Metadata
> -          Locking
> -          Hierarchies
> -          Access rights
> -          Lookups
> So if anyone knows of free, appropriately licensed alternative I’d be happy
> to use that.
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Mittwoch, 13. Juli 2011 18:52
> To: user@zookeeper.apache.org
> Subject: Re: Shared block storage via ZooKepper
> Simon,
> What you are describing is (roughly) a general read-write distributed and
> replicated file system.  This is a hard problem if you want high
> performance, absolute consistency and significant amounts of failure
> tolerance.  Building such a system from scratch is a difficult proposition.
> Frankly, it also sounds just like the filesystem component of MapR
> (conflict alert, I work for MapR Technologies).  You may have additional
> constraints on what you are looking for, but to meet the requirements that
> you have already stated, you should take a look at our offering.  I can
> imagine scenarios where this wouldn't be satisfactory, particularly if this
> is a homework assignment, but if you are simply trying to solve a real
> engineering problem, it should do very well.  I don't want to hijack this
> list with non-Zookeeper discussion so feel free to contact me directly for
> more pointers.
> Ohh... I should mention MapR uses Zookeeper prominently and is glad to do
> so.  The strictness and durability of ZK are ideal as the last resort
> determinant of coordination.  In many areas of our system, the ZK trade-offs
> are not appropriate, especially where speed is critical, but that isn't what
> ZK was designed to do.  Using ZK appropriately gives extremely good results.
> On Wed, Jul 13, 2011 at 5:15 AM, Simon Felix <de@iru.ch<mailto:de@iru.ch>>
> wrote:
> Thanks for the reply. I’ll try to clarify my question a bit. I want to
> simulate a single, fault-tolerant shared block storage device. This means
> everything should be replicated and consistent. All that system manages is
> (for example) one billion blocks, each containing exactly 4096 bytes. I do
> not need any metadata per block or locking. There will be multiple nodes,
> all reading and writing the data concurrently. If two nodes A and B write to
> the same block concurrently I expect that all nodes have either version A or
> version B of the block afterwards.
> I’m not sure which of the option is the easiest to implement and which will
> give me the highest performance.
> #2: Cassandra: Would you store the data in multiple rows? Columns? How much
> data per column? I should probably ask the Cassandra people about this...
> #3: BookKeeper: Every node is writing to the data. I’d use BookKeeper as
> write-ahead log. Was BookKeeper built for that kind of workload?
> Has anyone else done something similar? I couldn’t find anything in the
> archives...
> Simon
> From: Flavio Junqueira [mailto:fpj@yahoo-inc.com<mailto:fpj@yahoo-inc.com
> >]
> Sent: Mittwoch, 13. Juli 2011 14:01
> To: user@zookeeper.apache.org<mailto:user@zookeeper.apache.org>
> Subject: Re: Shared block storage via ZooKepper
> Hi Simon, It is not entirely clear to me what you need zookeeper for in
> this case. Are blocks replicated and you need to guarantee that the updates
> are consistent across replicas?
> On your observations, I'm quite sure people will have an opinion, so here
> are my thoughts, which might not be representative of the whole community :
> 1- You're right, we do not recommended to use ZooKeeper directly as the
> data store. ZooKeeper servers keep their state in memory.
> 2- Cassandra already provides replication. Are you trying to strengthen the
> guarantees of Cassandra? I don't get it...
> 3- Sound right that you could use BK as a journal, but it is not clear
> which element is writing to the journal. Are you assuming a metadata manager
> such as the namenode of HDFS?
> 4- I'm not sure what this option means. Are you proposing ZooKeeper to
> manage the metadata of the file system? If so, I don't find it entirely
> unrealistic, since metadata updates are supposed to be small and the
> performance of ZooKeeper should be good enough for your case, but it might
> be awkward to have your block storage clients talking directly to ZooKeeper.
> Changes to metadata management would imply in this case rolling out a new
> version of the client application instead of just having the changes
> implemented on the service side.
> -Flavio
> On Jul 13, 2011, at 12:02 PM, Simon Felix wrote:
> Hello everyone
> What is the best way to build a distributed, shared storage system on top
> of
> ZooKeeper? I'm talking about block storage in the terabyte-range (i.e.
> store
> billions of 4k blocks). Consistency and Availability are important, as is
> throughput (both read & write). I need at least 50 MB/s with 3 nodes with
> two regular SATA drives each for my application.
> Some options I came up with:
> 1. Use ZooKeeper directly as a data store (Not recommended according to the
> docs - and it really leads to abysmally bad performance, I tested that)
> 2. Use Cassandra as data store
> 3. Use BookKeeper as write-ahead log and implement my own underlying store
> 4. Use ZooKeeper to create my own (probably buggy...) data store
> What would you recommend? Are there other options?
> Cheers,
> Simon
> flavio
> junqueira
> research scientist
> fpj@yahoo-inc.com<mailto:fpj@yahoo-inc.com>
> direct +34 93-183-8828<tel:%2B34%2093-183-8828>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300<tel:%28408%29%20349%203300>    fax (408) 349 3301
> <tel:%28408%29%20349%203301>
> [cid:image001.png@01CC4167.6B3CB7C0]

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message