Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6156398D1 for ; Sun, 25 Sep 2011 07:06:40 +0000 (UTC) Received: (qmail 10710 invoked by uid 500); 25 Sep 2011 07:06:39 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 10371 invoked by uid 500); 25 Sep 2011 07:06:36 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 10313 invoked by uid 99); 25 Sep 2011 07:06:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Sep 2011 07:06:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [129.60.39.148] (HELO tama500.ecl.ntt.co.jp) (129.60.39.148) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 25 Sep 2011 07:06:27 +0000 Received: from mfs6.rdh.ecl.ntt.co.jp (mfs6.rdh.ecl.ntt.co.jp [129.60.39.149]) by tama500.ecl.ntt.co.jp (8.14.5/8.14.5) with ESMTP id p8P75R5D003742; Sun, 25 Sep 2011 16:05:27 +0900 (JST) Received: from mfs6.rdh.ecl.ntt.co.jp (localhost [127.0.0.1]) by mfs6.rdh.ecl.ntt.co.jp (Postfix) with ESMTP id 8822D6825; Sun, 25 Sep 2011 16:05:27 +0900 (JST) Received: from dmailsv1.y.ecl.ntt.co.jp (dmailsv1.y.ecl.ntt.co.jp [129.60.53.14]) by mfs6.rdh.ecl.ntt.co.jp (Postfix) with ESMTP id 7DF6F660F; Sun, 25 Sep 2011 16:05:27 +0900 (JST) Received: from mailsv14.y.ecl.ntt.co.jp by dmailsv1.y.ecl.ntt.co.jp (8.14.4/dmailsv1-2.1) with ESMTP id p8P75QtR017637; Sun, 25 Sep 2011 16:05:26 +0900 (JST) Received: from localhost by mailsv14.y.ecl.ntt.co.jp (8.14.4/Lab-1.9) with ESMTP id p8P75PRl020994; Sun, 25 Sep 2011 16:05:26 +0900 (JST) Message-ID: <4E7ED20D.9060209@lab.ntt.co.jp> Date: Sun, 25 Sep 2011 16:02:37 +0900 From: OZAWA Tsuyoshi User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 MIME-Version: 1.0 To: user@hbase.apache.org CC: Flavio Junqueira , "user@zookeeper.apache.org" , MORITA Kazutaka Subject: Re: [announce] Accord: A high-performance coordination service for write-intensive workloads References: <4E7C9640.8080808@lab.ntt.co.jp> <5ED73FD9-E5B5-4E3C-A794-139DA2E282B3@yahoo-inc.com> In-Reply-To: <5ED73FD9-E5B5-4E3C-A794-139DA2E282B3@yahoo-inc.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org (2011/09/25 6:43), Flavio Junqueira wrote: > Thanks for sending this reference to the list, it sounds very > interesting. I have a few questions and comments, if you don't mind: > > 1- I was wondering if you can give more detail on the setup you used to > generate the numbers you show in the graphs on your Accord page. The > ZooKeeper values are way too low, and I suspect that you're using a > single hard drive. It could be because you expect to use a single hard > drive with an Accord server, and you wanted to make the comparison fair. > Is this correct? No, it isn't. Both ZooKeeper and Accord use the dedicated hard drive for logging. Setting file I used is here: https://gist.github.com/1240291 Please tell me if I have a mistake. > 2- The previous observation leads me to the next question: could you say > more about your use of disk with persistence on? ZooKeeper returns ACK after writing the disks of the over half machines. Accord returns ACK after writing the disk of just one machine, which accepted a request. However, at the same time, the ACK assures that all servers receive the messages in the same order. The difference of the semantics means that this measurement is not fair. I would like to measure the under fair situation, but not yet. If there are requests from users, I'm going to implement it and measure it. Note that the benchmark of in-memory is fair. > 3- The limitation on the message size in ZooKeeper is not a fundamental > limitation. We have chosen to limit for the reasons we explain in the > wiki page that is linked in the Accord page. Do you have any particular > use case in mind for which you think it would be useful to have very > large messages? Some developers use ZooKeeper as storage. For example, Onix developer, a implementation of open flow switch, says that : "for most the object size limitations of Zookeeper and convenience of accessing the configuration state directly through the NIB are a reason to favor the transactional database." http://www.usenix.org/event/osdi10/tech/full_papers/Koponen.pdf > 4- If I understand the group communication substrate Accord uses, it > enables Accord to process client requests in any server. ZooKeeper has a > leader for a few reasons, one being the ability of managing client > sessions. Ephemeral nodes, for example, are bound to sessions. Are there > similar abstractions in Accord? If the answer is positive, could you > explain it a bit? If not, is it doable with the substrate you're using? Yes, Accord has abstractions like Ephemeral nodes. We use Corosync cluster engine, which provides Virtual Synchrony semantics. It assures of having consensus of the message ordering and a server-failure ordering among all servers(conductor daemons). > 5- I'm not sure where we say that 8 bytes is a typical value in the > documentation. I actually remember writing in one of our papers that a > typical value is around 1k bytes. The benchmark assumes the lock. I'm going to measure various message sizes. I'll report it. Please ask me if you have more questions or opinions. - OZAWA Tsuyoshi > -Flavio > > On Sep 23, 2011, at 4:22 PM, OZAWA Tsuyoshi wrote: > >> Hi, >> >> Sending zookeeper-users and hbase-users ml since there may be some >> cluster developers interested in participating in this project there. >> >> I am pleased to announce the initial release of Accord, yet another >> coordination service like Apache ZooKeeper. >> ZooKeeper is a de facto standard coordination kernel as you know at >> present. >> Accord provides ZK-like features as a coordination service. Concretely >> speaking, it features: >> - Accord is a distributed, transactional, and fully-replicated (No SPoF) >> Key-Value Store with strong consistency. >> - Accord can be scale-out up to tens of nodes. >> - Accord servers can handle tens or thousands of clients. >> - The changes for a write request from a client can be notified to the >> other clients. >> - Accord detects events of client's joining/leaving, and notifies >> joined/left client information to the other clients. >> >> There are some problems in ZK, however, as follows: >> - ZK cannot handle write-intensive workloads well. ZK forwards all write >> requests to a master server. It may be bottleneck in write-intensive >> workload. >> - ZK is optimized for disk-persistence mode, not for in-memory mode. >> ZOOKEEPER-866 shows that ZK has the other bottleneck outside disk >> persistence, though there are some needs of a fully-replicated storage >> with both strong consistency and low latency. >> https://issues.apache.org/jira/browse/ZOOKEEPER-866 >> - Limited Transaction APIs. ZK can only issue write operations (write, >> del) in a transaction(multi-update). >> >> These restriction limit the capability of the coordination kernel. >> Accord solves such problems. >> 1. Accord uses Corosync Cluster Engine as a total-order messaging >> infrastructure instead of Zab, an atomic broadcast protocol ZK uses. The >> engine enable any servers to accept and process requests. >> 2. Accord supports in-memory mode. >> 3. More flexible transaction support. Not only write, del operations, >> but also cmp, copy, read operations are supported in transaction >> operation. >> >> These differences of the core engine (1, 2) enable us to avoid master >> bottleneck. Benchmark demonstrates that the write-operation throughput >> of Accord is much higher than one of ZooKeeper >> (up to 20 times better throughput at persistent mode, and up to 18 times >> better throughput at in-memory mode). >> >> The high performance kernel can extend the application ranges. Assumed >> applications are as follows, for instance : >> - Distributed Lock Manager whose lock operations occur at a high >> frequency from thousands of clients. >> I assume that the lock manager for Hbase in particluar. The coordination >> service enables HBase to update multiple rows with ACID properties. >> Hbase acts as distributed DB with ACID properties until the coordination >> service becomes the bottleneck. The new coordination kernel, Accord, can >> handle 18 times better throughput than ZK. As a result, Accord can >> dramatically improve the scalability of Hbase with ACID properties. >> - Metadata management service for large-scale distributed storage, >> including HDFS, Ceph and Sheepdog etc. >> Replicated-master can be implemented easily. >> - Replicated Message Queue or logger (For instance, replicated RabbitMQ). >> and so on. >> >> The other distributed systems can use Accord features easily because >> Accord provides general-purpose APIs (read/write/del/more flexible >> transaction). >> >> More information including getting started, benchmarks, and API docs are >> available from our project page : >> http://www.osrg.net/accord >> >> and all code is available from: >> http://github.com/collie/accord >> >> Please try it out, and let me know any opinions or problems. >> >> Best regards, >> OZAWA Tsuyoshi >> >> > > flavio > junqueira > > research scientist > > fpj@yahoo-inc.com > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > >