zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yang <teddyyyy...@gmail.com>
Subject Re: Shared block storage via ZooKepper
Date Wed, 13 Jul 2011 17:17:51 GMT
actually I was just thinking about this and tried to ask exactly the same

now zk is used to store small pieces of data such as shared config, and used
for locking/coordination, but since it has a replicated data store, it would
be nice to use to store large volumes of data directly.

in fact from the "Paxos made live" paper:
 page 3
"We devoted effort to designing clean interfaces separating the Paxos
framework, the database, and
Chubby. We did this partly for clarity while developing this system, but
also with the intention of reusing the
replicated log layer in other applications. We anticipate future systems at
Google that seek fault-tolerance
through replication. We believe that a fault-tolerant log is a powerful
primitive on which to build such

essentially in the google paxos implementation, application code can simply
grab the latest committed log record, and use it for whatever it wants for
the application. if Zookeeper abstracts out the messaging protocol, and
provides the committed transaction "stream" as the interface to
applications, potentially we could use it for many applications, including
data storage. note that this is completely outside of the current ZK data
model (znode and etc ), all we use from ZK is the   underlying committed
transactions stream, probably this part of ZK can be provided as a library.


On Wed, Jul 13, 2011 at 5:01 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:

> Hi Simon, It is not entirely clear to me what you need zookeeper for in
> this case. Are blocks replicated and you need to guarantee that the updates
> are consistent across replicas?
> On your observations, I'm quite sure people will have an opinion, so here
> are my thoughts, which might not be representative of the whole community :
> 1- You're right, we do not recommended to use ZooKeeper directly as the
> data store. ZooKeeper servers keep their state in memory.
> 2- Cassandra already provides replication. Are you trying to strengthen the
> guarantees of Cassandra? I don't get it...
> 3- Sound right that you could use BK as a journal, but it is not clear
> which element is writing to the journal. Are you assuming a metadata manager
> such as the namenode of HDFS?
> 4- I'm not sure what this option means. Are you proposing ZooKeeper to
> manage the metadata of the file system? If so, I don't find it entirely
> unrealistic, since metadata updates are supposed to be small and the
> performance of ZooKeeper should be good enough for your case, but it might
> be awkward to have your block storage clients talking directly to ZooKeeper.
> Changes to metadata management would imply in this case rolling out a new
> version of the client application instead of just having the changes
> implemented on the service side.
> -Flavio
> On Jul 13, 2011, at 12:02 PM, Simon Felix wrote:
> Hello everyone
> What is the best way to build a distributed, shared storage system on top
> of
> ZooKeeper? I'm talking about block storage in the terabyte-range (i.e.
> store
> billions of 4k blocks). Consistency and Availability are important, as is
> throughput (both read & write). I need at least 50 MB/s with 3 nodes with
> two regular SATA drives each for my application.
> Some options I came up with:
> 1. Use ZooKeeper directly as a data store (Not recommended according to the
> docs - and it really leads to abysmally bad performance, I tested that)
> 2. Use Cassandra as data store
> 3. Use BookKeeper as write-ahead log and implement my own underlying store
> 4. Use ZooKeeper to create my own (probably buggy...) data store
> What would you recommend? Are there other options?
> Cheers,
> Simon
> *flavio*
> *junqueira*
> research scientist
> fpj@yahoo-inc.com
> direct +34 93-183-8828
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301

  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message