zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tsuyoshi OZAWA <ozawa.tsuyo...@gmail.com>
Subject Re: [announce] Accord: A high-performance coordination service for write-intensive workloads
Date Tue, 27 Sep 2011 02:19:54 GMT
On Mon, Sep 26, 2011 at 6:49 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> On Sep 25, 2011, at 9:02 AM, OZAWA Tsuyoshi wrote:
>
>> (2011/09/25 6:43), Flavio Junqueira wrote:
>>>
>>> Thanks for sending this reference to the list, it sounds very
>>> interesting. I have a few questions and comments, if you don't mind:
>>>
>>> 1- I was wondering if you can give more detail on the setup you used to
>>> generate the numbers you show in the graphs on your Accord page. The
>>> ZooKeeper values are way too low, and I suspect that you're using a
>>> single hard drive. It could be because you expect to use a single hard
>>> drive with an Accord server, and you wanted to make the comparison fair.
>>> Is this correct?
>>
>> No, it isn't.
>> Both ZooKeeper and Accord use the dedicated hard drive for logging.
>> Setting file I used is here:
>> https://gist.github.com/1240291
>>
>> Please tell me if I have a mistake.
>>
>
> I gave a cursory look, and I can't see any obvious problem. It is intriguing
> that the numbers are so low. Have you tried with different numbers of
> servers? I'm not sure if I just missed this information, but what version of
> ZooKeeper are you looking at?

I use ZooKeeper 3.3.3, the latest release version, with 7200 rpm HDD.
The benchmarks is measured with 3, 5, 7 servers.

> Also, if it is not too much trouble, could you please report on your read performance?

I'll measure it, and report it later.

>>> 2- The previous observation leads me to the next question: could you say
>>> more about your use of disk with persistence on?
>>
>> ZooKeeper returns ACK after writing the disks of the over half machines.
>> Accord returns ACK after writing the disk of just one machine, which
>> accepted a request. However, at the same time, the ACK assures that all
>> servers receive the messages in the same order.
>> The difference of the semantics means that this measurement is not fair.
>> I would like to measure the under fair situation, but not yet. If there
>> are requests from users, I'm going to implement it and measure it. Note
>> that the benchmark of in-memory is fair.
>>
>
> I'm not sure I understand this part. You say that an operation is ACKed
> after being written to one disk, but also that it is guaranteed to be
> delivered in the same order in all servers. Does it mean that Accord still
> replicates on other servers before ACKing but the other servers do not write
> to disk?

Yes, you're right.

> One question related to this point: with Accord, do you replicate the
> original request message or the result of operation?

Accord replicates request message.

> Do you guarantee that each server executes a request or applies
> the result of a request exactly once?

Yes, it does.

> The comment in the paper is exactly right, we instruct our users to store
> metadata in ZooKeeper and data elsewhere. There are systems designed to
> store bulk data, and ZooKeeper shouldn't try to compete with such storage
> systems, it is not our goal.

You're right.
We decided to support unlimited data size, because some
write-intensive applications such as replicated queue, or replicated
database need to send big messages, though it can block next messages.

In that way, I think that Accord may not compete with ZooKeeper,
because the application ranges is different. ZooKeeper is suitable for
read-mostly applications such as sharing configuration, leader
election, and failure detection, while Accord is more suitable for
write-mostly applications.

>>> 4- If I understand the group communication substrate Accord uses, it
>>> enables Accord to process client requests in any server. ZooKeeper has a
>>> leader for a few reasons, one being the ability of managing client
>>> sessions. Ephemeral nodes, for example, are bound to sessions. Are there
>>> similar abstractions in Accord? If the answer is positive, could you
>>> explain it a bit? If not, is it doable with the substrate you're using?
>>
>> Yes, Accord has abstractions like Ephemeral nodes.
>> We use Corosync cluster engine, which provides Virtual Synchrony
>> semantics. It assures of having consensus of the message ordering and a
>> server-failure ordering among all servers(conductor daemons).
>
> Good to know that you also support ephemerals. Could you say a little more
> about how you decide to eliminate an ephemeral node? I suppose that an
> ephemeral is bound to the client that created it somehow, and it is deleted
> if the client crashes or disconnects. What's the exact mechanism?

Accord currently provides more primitive abstractions.
Accord servers assign client id when a client joins to Accord cluster,
and notify client id of the left client when a client leaves. The
clients mark down own id if it's ephemeral node when they tries to
write.
The other clients received a notification of leaving a client try to
delete the nodes which the left client created.

On Mon, Sep 26, 2011 at 6:49 AM, Flavio Junqueira <fpj@yahoo-inc.com> wrote:
> On Sep 25, 2011, at 9:02 AM, OZAWA Tsuyoshi wrote:
>
>> (2011/09/25 6:43), Flavio Junqueira wrote:
>>>
>>> Thanks for sending this reference to the list, it sounds very
>>> interesting. I have a few questions and comments, if you don't mind:
>>>
>>> 1- I was wondering if you can give more detail on the setup you used to
>>> generate the numbers you show in the graphs on your Accord page. The
>>> ZooKeeper values are way too low, and I suspect that you're using a
>>> single hard drive. It could be because you expect to use a single hard
>>> drive with an Accord server, and you wanted to make the comparison fair.
>>> Is this correct?
>>
>> No, it isn't.
>> Both ZooKeeper and Accord use the dedicated hard drive for logging.
>> Setting file I used is here:
>> https://gist.github.com/1240291
>>
>> Please tell me if I have a mistake.
>>
>
> I gave a cursory look, and I can't see any obvious problem. It is intriguing
> that the numbers are so low. Have you tried with different numbers of
> servers? I'm not sure if I just missed this information, but what version of
> ZooKeeper are you looking at?  Also, if it is not too much trouble, could
> you please report on your read performance?
>
>>> 2- The previous observation leads me to the next question: could you say
>>> more about your use of disk with persistence on?
>>
>> ZooKeeper returns ACK after writing the disks of the over half machines.
>> Accord returns ACK after writing the disk of just one machine, which
>> accepted a request. However, at the same time, the ACK assures that all
>> servers receive the messages in the same order.
>> The difference of the semantics means that this measurement is not fair.
>> I would like to measure the under fair situation, but not yet. If there
>> are requests from users, I'm going to implement it and measure it. Note
>> that the benchmark of in-memory is fair.
>>
>
> I'm not sure I understand this part. You say that an operation is ACKed
> after being written to one disk, but also that it is guaranteed to be
> delivered in the same order in all servers. Does it mean that Accord still
> replicates on other servers before ACKing but the other servers do not write
> to disk? Otherwise, the first server may crash and never come back, and the
> message cannot possibly be delivered by other servers.
>
> One question related to this point: with Accord, do you replicate the
> original request message or the result of operation? Do you guarantee that
> each server executes a request or applies the result of a request exactly
> once? If not, what kind of semantics does Accord provide?
>
>>> 3- The limitation on the message size in ZooKeeper is not a fundamental
>>> limitation. We have chosen to limit for the reasons we explain in the
>>> wiki page that is linked in the Accord page. Do you have any particular
>>> use case in mind for which you think it would be useful to have very
>>> large messages?
>>
>> Some developers use ZooKeeper as storage. For example, Onix developer, a
>> implementation of open flow switch, says that :
>> "for most the object size limitations of
>> Zookeeper and convenience of accessing the configuration
>> state directly through the NIB are a reason to favor the
>> transactional database."
>> http://www.usenix.org/event/osdi10/tech/full_papers/Koponen.pdf
>>
>
> The comment in the paper is exactly right, we instruct our users to store
> metadata in ZooKeeper and data elsewhere. There are systems designed to
> store bulk data, and ZooKeeper shouldn't try to compete with such storage
> systems, it is not our goal.
>
>>> 4- If I understand the group communication substrate Accord uses, it
>>> enables Accord to process client requests in any server. ZooKeeper has a
>>> leader for a few reasons, one being the ability of managing client
>>> sessions. Ephemeral nodes, for example, are bound to sessions. Are there
>>> similar abstractions in Accord? If the answer is positive, could you
>>> explain it a bit? If not, is it doable with the substrate you're using?
>>
>> Yes, Accord has abstractions like Ephemeral nodes.
>> We use Corosync cluster engine, which provides Virtual Synchrony
>> semantics. It assures of having consensus of the message ordering and a
>> server-failure ordering among all servers(conductor daemons).
>
> Good to know that you also support ephemerals. Could you say a little more
> about how you decide to eliminate an ephemeral node? I suppose that an
> ephemeral is bound to the client that created it somehow, and it is deleted
> if the client crashes or disconnects. What's the exact mechanism?
>
>>
>>> 5- I'm not sure where we say that 8 bytes is a typical value in the
>>> documentation. I actually remember writing in one of our papers that a
>>> typical value is around 1k bytes.
>>
>> The benchmark assumes the lock. I'm going to measure various message
>> sizes. I'll report it.
>
> Sounds good, thanks.
>
> -Flavio
>
>>
>>
>>> -Flavio
>>>
>>> On Sep 23, 2011, at 4:22 PM, OZAWA Tsuyoshi wrote:
>>>
>>>> Hi,
>>>>
>>>> Sending zookeeper-users and hbase-users ml since there may be some
>>>> cluster developers interested in participating in this project there.
>>>>
>>>> I am pleased to announce the initial release of Accord, yet another
>>>> coordination service like Apache ZooKeeper.
>>>> ZooKeeper is a de facto standard coordination kernel as you know at
>>>> present.
>>>> Accord provides ZK-like features as a coordination service. Concretely
>>>> speaking, it features:
>>>> - Accord is a distributed, transactional, and fully-replicated (No SPoF)
>>>> Key-Value Store with strong consistency.
>>>> - Accord can be scale-out up to tens of nodes.
>>>> - Accord servers can handle tens or thousands of clients.
>>>> - The changes for a write request from a client can be notified to the
>>>> other clients.
>>>> - Accord detects events of client's joining/leaving, and notifies
>>>> joined/left client information to the other clients.
>>>>
>>>> There are some problems in ZK, however, as follows:
>>>> - ZK cannot handle write-intensive workloads well. ZK forwards all write
>>>> requests to a master server. It may be bottleneck in write-intensive
>>>> workload.
>>>> - ZK is optimized for disk-persistence mode, not for in-memory mode.
>>>> ZOOKEEPER-866 shows that ZK has the other bottleneck outside disk
>>>> persistence, though there are some needs of a fully-replicated storage
>>>> with both strong consistency and low latency.
>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-866
>>>> - Limited Transaction APIs. ZK can only issue write operations (write,
>>>> del) in a transaction(multi-update).
>>>>
>>>> These restriction limit the capability of the coordination kernel.
>>>> Accord solves such problems.
>>>> 1. Accord uses Corosync Cluster Engine as a total-order messaging
>>>> infrastructure instead of Zab, an atomic broadcast protocol ZK uses. The
>>>> engine enable any servers to accept and process requests.
>>>> 2. Accord supports in-memory mode.
>>>> 3. More flexible transaction support. Not only write, del operations,
>>>> but also cmp, copy, read operations are supported in transaction
>>>> operation.
>>>>
>>>> These differences of the core engine (1, 2) enable us to avoid master
>>>> bottleneck. Benchmark demonstrates that the write-operation throughput
>>>> of Accord is much higher than one of ZooKeeper
>>>> (up to 20 times better throughput at persistent mode, and up to 18 times
>>>> better throughput at in-memory mode).
>>>>
>>>> The high performance kernel can extend the application ranges. Assumed
>>>> applications are as follows, for instance :
>>>> - Distributed Lock Manager whose lock operations occur at a high
>>>> frequency from thousands of clients.
>>>> I assume that the lock manager for Hbase in particluar. The coordination
>>>> service enables HBase to update multiple rows with ACID properties.
>>>> Hbase acts as distributed DB with ACID properties until the coordination
>>>> service becomes the bottleneck. The new coordination kernel, Accord, can
>>>> handle 18 times better throughput than ZK. As a result, Accord can
>>>> dramatically improve the scalability of Hbase with ACID properties.
>>>> - Metadata management service for large-scale distributed storage,
>>>> including HDFS, Ceph and Sheepdog etc.
>>>> Replicated-master can be implemented easily.
>>>> - Replicated Message Queue or logger (For instance, replicated
>>
>> RabbitMQ).
>>>>
>>>> and so on.
>>>>
>>>> The other distributed systems can use Accord features easily because
>>>> Accord provides general-purpose APIs (read/write/del/more flexible
>>>> transaction).
>>>>
>>>> More information including getting started, benchmarks, and API docs are
>>>> available from our project page :
>>>> http://www.osrg.net/accord
>>>>
>>>> and all code is available from:
>>>> http://github.com/collie/accord
>>>>
>>>> Please try it out, and let me know any opinions or problems.
>>>>
>>>> Best regards,
>>>> OZAWA Tsuyoshi
>>>> <ozawa.tsuyoshi@lab.ntt.co.jp>
>>>>
>>>
>>> flavio
>>> junqueira
>>>
>>> research scientist
>>>
>>> fpj@yahoo-inc.com
>>> direct +34 93-183-8828
>>>
>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>>> phone (408) 349 3300 fax (408) 349 3301
>>>
>>>
>
> flavio
> junqueira
>
> research scientist
>
> fpj@yahoo-inc.com
> direct +34 93-183-8828
>
> avinguda diagonal 177, 8th floor, barcelona, 08018, es
> phone (408) 349 3300    fax (408) 349 3301
>
>



-- 
OZAWA Tsuyoshi <ozawa.tsuyoshi@gmail.com>

Mime
View raw message