Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 46317 invoked from network); 1 Apr 2009 19:38:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2009 19:38:28 -0000 Received: (qmail 41175 invoked by uid 500); 1 Apr 2009 19:38:28 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 41134 invoked by uid 500); 1 Apr 2009 19:38:28 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 41124 invoked by uid 99); 1 Apr 2009 19:38:28 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 19:38:28 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of avinash.lakshman@gmail.com designates 209.85.198.250 as permitted sender) Received: from [209.85.198.250] (HELO rv-out-0708.google.com) (209.85.198.250) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Apr 2009 19:38:19 +0000 Received: by rv-out-0708.google.com with SMTP id b17so175387rvf.0 for ; Wed, 01 Apr 2009 12:37:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=cwHhVk4WOL21mxE+esnfKxbFhO9S0shq1qMSKhwrwsA=; b=Rig5oHUw2u53s1BAqkjf0N6ftq/8Ue7Ssx7Kcu3zhNp5I3RJ1jDPvwz8o+TdBZgqUI 8Xw/SwtPh3T7p03ZDB/gFxziSFqm624sGW90MiL7ee7yZ7Sv70BvyeGnAitomAtLUeRH 0as7EBj8lWuVMrJq9gF6R4czmdQF8Y3l0w2dE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=BaxHwFpiv0PiUEp5H8ueCJZf/2T6U3XiX+0f5rsrufaJulR5BhzcPDhCB745uu25r8 s5SAfHO3YFSuPEquUROn3QNWs2A47w+aN+4ewtjwvu6hysKb3tLIdF9c3cJ2KHnSenXy F/u1JpdRidSf/3lMXWEDKfQDF7FuQO7TTcl6M= MIME-Version: 1.0 Received: by 10.114.53.18 with SMTP id b18mr5445832waa.6.1238614678355; Wed, 01 Apr 2009 12:37:58 -0700 (PDT) In-Reply-To: References: Date: Wed, 1 Apr 2009 12:37:58 -0700 Message-ID: Subject: Re: handling deletes From: Avinash Lakshman To: cassandra-dev@incubator.apache.org Content-Type: multipart/alternative; boundary=00163645856eeac2ec046683730d X-Virus-Checked: Checked by ClamAV on apache.org --00163645856eeac2ec046683730d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit In reality from 2 years of production experience in Dynamo and here with Cassandra it is not as extreme as it seems :). Options are either strong consistency which is hard to get right in a distributed setting. If you do get it right then there is availability problem. All tools like read-repair etc help in achieving eventual consistency. So I guess it boils down to what you want from your app C or A. Avinash On Wed, Apr 1, 2009 at 11:46 AM, Jun Rao wrote: > > My reply is inlined below. > > Jun > IBM Almaden Research Center > K55/B1, 650 Harry Road, San Jose, CA 95120-6099 > > junrao@almaden.ibm.com > > > Jonathan Ellis wrote on 04/01/2009 10:50:37 AM: > > > > > On Wed, Apr 1, 2009 at 11:41 AM, Jun Rao wrote: > > > I am wondering if this is part of the bigger issue on data consistency. > > > > > > Following your example: a row x is replicated to node A, B, and C. C > goes > > > down. A and B delete x. When C comes back, C should contact other nodes > that > > > hold hinted handoff data intended for C. So, in theory, the missing > deletion > > > of x will be propagated to C at some point and not lost. However, the > > > problem is that those hinted handoff nodes can die before the handoff > > > completes. Then C need some other way to sync itself up. Node A and B > are > > > the only possible sources. Unfortunately, data in A and B are > accumulated > > > independently from C, and therefore syncing them up is a bit > challenging. > > > > Right. Or you could have a network partition when C comes back up > > preventing the handoff. There's lots of things that can go wrong. > > Hence the "eventual" part of "eventually consistent." :) > > > > > In the short run, I am not sure if I really like the solution you > suggested > > > here. However, I don't have a better solution either. > > > > Like I said; it's not perfect, but it's better than the alternatives > > I've seen. I'd much rather have an imperfect solution than none at > > all. > > > > > In the long run, maybe we should look into peer-to-peer replication > > > techniques, instead of relying on hinted handoff. In P2P replication, > an > > > update can be directed to any replica, which will try to push it to its > > > peers. The push will be almost real time if the peers are up. If a peer > is > > > down, changes for it will be accumulated and re-pushed when it's up > again. > > > Because an update is always initiated from one replica, it's easier to > sync > > > up the replicas through log shipping. > > > > There's a huge amount of complexity you're glossing over, though: what > > if the replica responsible for the initiation goes down? Then you > > have to elect a new one. This is (a) very complicated and (b) causes > > loss of availability. I prefer the existing system. (If you want > > consistency over availability then hbase or hypertable is a better > > choice since that is what they design for.) > > P2P replication definitely adds complexity and it is just one of the > alternatives. However, there is also complexity in hinted handoff + read > repair + merkle tree (when it's added). Not sure which one is more > complicated. In P2P replication, since you can initiate a write on any > replica, you just need to pick a live replica for writes. As for > availability, a lot have to do with how quickly a failed node is detected. > Today, if you write to a node that's actually failed, but not yet detected > by Cassandra, the write will also fail. > > Overall, I think eventual consistency is fine. However, eventual > consistency probably shouldn't be equated to updates taking forever to show > up. Some sort of guarantee on how outdated a piece of data is will likely > be useful to many applications. > > > > > -Jonathan --00163645856eeac2ec046683730d--