Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CECC2E6DE for ; Mon, 25 Feb 2013 22:22:15 +0000 (UTC) Received: (qmail 12019 invoked by uid 500); 25 Feb 2013 22:22:15 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 11994 invoked by uid 500); 25 Feb 2013 22:22:15 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 11984 invoked by uid 99); 25 Feb 2013 22:22:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Feb 2013 22:22:15 +0000 Date: Mon, 25 Feb 2013 22:22:15 +0000 (UTC) From: "Cristian Opris (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (CASSANDRA-5062) Support CAS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-5062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13586371#comment-13586371 ] Cristian Opris edited comment on CASSANDRA-5062 at 2/25/13 10:21 PM: --------------------------------------------------------------------- Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it is their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significant performance advantages over locks. Just count how many roundtrips you'd need in the "wait chain" algorithm. Not to mentioned handling expired/orphan locks was (Author: onetoinfinity@yahoo.com): Afaict from the Spinnaker paper they only require ZK for fault tolerant leader election, failure detection and possibly cluster membership. (The right lower box in the diagram in 4.1) The rest of it their actual data storage engine. A few more comments: 1. Paxos can be made very efficient particularly in stable operation scenarios. I believe Zab devolves effectively in atomic broadcast (not even 2PC) with a stable leader. So you can normally do writes with a single roundtrip just like now. 2. There is a difference between what I described above and what Spinnaker does. I believe they elect a leader for the entire replica group while my description assumes 1 full paxos instance per row write. I'm not fully clear atm how this would work but I believe even that can be optimized to single roundtrips per write in normal operation (I believe it's in one of Google's papers that they piggyback the commit on the next proposal for example) Off the top of my head: coordinator assumes one of the replicas as being most up-to-date, attempts to use it as leader. Replica starts Paxos round attaching the write payload. If accepted on a majority replica can send commit. Opportunistically attaches further proposals to it. If Paxos round fails (or a number of rounds fail) it's likely the replica is behind on many rows so coordinator switches to another replica. Now this is all preliminary as I haven't fully thought this through but I think it's definitely worth investigating. While it may be a complicated protocol it has significan performance advantages over locks. Just count how many roundtrips you'd need in the "wait chain" algorithm. Not to mentioned handling expired/orphan locks > Support CAS > ----------- > > Key: CASSANDRA-5062 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5062 > Project: Cassandra > Issue Type: New Feature > Components: API, Core > Reporter: Jonathan Ellis > Fix For: 2.0 > > > "Strong" consistency is not enough to prevent race conditions. The classic example is user account creation: we want to ensure usernames are unique, so we only want to signal account creation success if nobody else has created the account yet. But naive read-then-write allows clients to race and both think they have a green light to create. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira