From Simon Felix ...@iru.ch>
Subject RE: Shared block storage via ZooKepper
Date Wed, 13 Jul 2011 12:15:56 GMT
Thanks for the reply. I'll try to clarify my question a bit. I want to
simulate a single, fault-tolerant shared block storage device. This means
everything should be replicated and consistent. All that system manages is
(for example) one billion blocks, each containing exactly 4096 bytes. I do
not need any metadata per block or locking. There will be multiple nodes,
all reading and writing the data concurrently. If two nodes A and B write to
the same block concurrently I expect that all nodes have either version A or
version B of the block afterwards.


I'm not sure which of the option is the easiest to implement and which will
give me the highest performance.


#2: Cassandra: Would you store the data in multiple rows? Columns? How much
data per column? I should probably ask the Cassandra people about this...

#3: BookKeeper: Every node is writing to the data. I'd use BookKeeper as
write-ahead log. Was BookKeeper built for that kind of workload?


Has anyone else done something similar? I couldn't find anything in the






From: Flavio Junqueira [mailto:fpj@yahoo-inc.com] 
Sent: Mittwoch, 13. Juli 2011 14:01
To: user@zookeeper.apache.org
Subject: Re: Shared block storage via ZooKepper


Hi Simon, It is not entirely clear to me what you need zookeeper for in this
case. Are blocks replicated and you need to guarantee that the updates are
consistent across replicas? 


On your observations, I'm quite sure people will have an opinion, so here
are my thoughts, which might not be representative of the whole community :

1- You're right, we do not recommended to use ZooKeeper directly as the data
store. ZooKeeper servers keep their state in memory.

2- Cassandra already provides replication. Are you trying to strengthen the
guarantees of Cassandra? I don't get it...

3- Sound right that you could use BK as a journal, but it is not clear which
element is writing to the journal. Are you assuming a metadata manager such
as the namenode of HDFS?

4- I'm not sure what this option means. Are you proposing ZooKeeper to
manage the metadata of the file system? If so, I don't find it entirely
unrealistic, since metadata updates are supposed to be small and the
performance of ZooKeeper should be good enough for your case, but it might
be awkward to have your block storage clients talking directly to ZooKeeper.
Changes to metadata management would imply in this case rolling out a new
version of the client application instead of just having the changes
implemented on the service side.  




On Jul 13, 2011, at 12:02 PM, Simon Felix wrote:

Hello everyone

What is the best way to build a distributed, shared storage system on top of
ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store
billions of 4k blocks). Consistency and Availability are important, as is
throughput (both read & write). I need at least 50 MB/s with 3 nodes with
two regular SATA drives each for my application.

Some options I came up with:
1. Use ZooKeeper directly as a data store (Not recommended according to the
docs - and it really leads to abysmally bad performance, I tested that)
2. Use Cassandra as data store
3. Use BookKeeper as write-ahead log and implement my own underlying store
4. Use ZooKeeper to create my own (probably buggy...) data store

What would you recommend? Are there other options?



