Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4AD096B46 for ; Sat, 2 Jul 2011 17:12:33 +0000 (UTC) Received: (qmail 68450 invoked by uid 500); 2 Jul 2011 17:12:31 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 68426 invoked by uid 500); 2 Jul 2011 17:12:30 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 68418 invoked by uid 99); 2 Jul 2011 17:12:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 17:12:30 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-wy0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 02 Jul 2011 17:12:26 +0000 Received: by wyj26 with SMTP id 26so3165007wyj.31 for ; Sat, 02 Jul 2011 10:12:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=AdiccAUskaxWjzbwebI4jR5gTAP/QdmsjTtCCQVje0k=; b=epuJKHGlQfhnEKysf1t3BUH1bA/QSBlv9RU695qAq4ysAumYB8Yr4RqG5BIiywa9fu f5lCnVbhUZrjUpQjye2VofhIgNyQHCJkkdZidyRFB5Yi/TI4P/8qP5M1jYlBcDv6655I BIwfVzsuczc8vw9ik0W2fZTyr6BmRHmREBSjk= Received: by 10.216.50.2 with SMTP id y2mr3709605web.77.1309626723087; Sat, 02 Jul 2011 10:12:03 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.38.85 with HTTP; Sat, 2 Jul 2011 10:11:43 -0700 (PDT) In-Reply-To: References: <4E0E7A07.2050807@dude.podzone.net> From: Jonathan Ellis Date: Sat, 2 Jul 2011 12:11:43 -0500 Message-ID: Subject: Re: Strong Consistency with ONE read/writes To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The way HBase uses ZK (for master election) is not even close to how Cassandra uses the failure detector. Using ZK for each operation would (a) not scale and (b) not work cross-DC for any reasonable latency requirements. On Sat, Jul 2, 2011 at 11:55 AM, Yang wrote: > there is a JIRA completed in 0.7.x that "Prefers" a certain node in snitc= h, > so this does roughly what you want MOST of the time > > but the problem is that it does not GUARANTEE that the same node will alw= ays > be read. =A0I recently read into the HBase vs Cassandra comparison thread= that > started after Facebook dropped Cassandra for their messaging system, and > understood some of the differences. what you want is essentially what HBa= se > does. the fundamental difference there is really due to the gossip protoc= ol: > it's a probablistic, or eventually consistent failure detector =A0while > HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure > detector (a distributed lock). =A0so in HBase, if a tablet server goes do= wn, > it really goes down, it can not re-grab the tablet from the new tablet > server without going through a start up protocol (notifying the master, > which would notify the clients etc), =A0in other words it is guaranteed t= hat > one tablet is served by only one tablet server at any given time. =A0in > comparison the above JIRA only TRYIES to serve that key from one particul= ar > replica. HBase can have that guarantee because the group membership is > maintained by the strong failure detector. > just for hacking curiosity, a strong failure detector + Cassandra replica= s > is not impossible (actually seems not difficult), although the performanc= e > is not clear. what would such a strong failure detector bring to Cassandr= a > besides this ONE-ONE strong consistency ? that is an interesting question= I > think. > considering that HBase has been deployed on big clusters, it is probably = OK > with the performance of the strong =A0Zookeeper failure detector. then a > further question was: why did Dynamo originally choose to use the > probablistic failure detector? yes Dynamo's main theme is "eventually > consistent", so the Phi-detector is **enough**, but if a strong detector > buys us more with little cost, wouldn't that =A0be great? > > > On Fri, Jul 1, 2011 at 6:53 PM, AJ wrote: >> >> Is this possible? >> >> All reads and writes for a given key will always go to the same node fro= m >> a client. =A0It seems the only thing needed is to allow the clients to c= ompute >> which node is the closes replica for the given key using the same algori= thm >> C* uses. =A0When the first replica receives the write request, it will w= rite >> to itself which should complete before any of the other replicas and the= n >> return. =A0The loads should still stay balanced if using random partitio= ner. >> =A0If the first replica becomes unavailable (however that is defined), t= hen >> the clients can send to the next repilca in the ring and switch from ONE >> write/reads to QUORUM write/reads temporarily until the first replica >> becomes available again. =A0QUORUM is required since there could be some >> replicas that were not updated after the first replica went down. >> >> Will this work? =A0The goal is to have strong consistency with a read/wr= ite >> consistency level as low as possible while secondarily a network perform= ance >> boost. > > --=20 Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com