Return-Path: Delivered-To: apmail-incubator-cassandra-commits-archive@minotaur.apache.org Received: (qmail 17578 invoked from network); 8 Feb 2010 20:14:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Feb 2010 20:14:16 -0000 Received: (qmail 49904 invoked by uid 500); 8 Feb 2010 20:14:16 -0000 Delivered-To: apmail-incubator-cassandra-commits-archive@incubator.apache.org Received: (qmail 49867 invoked by uid 500); 8 Feb 2010 20:14:16 -0000 Mailing-List: contact cassandra-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-commits@incubator.apache.org Received: (qmail 49857 invoked by uid 99); 8 Feb 2010 20:14:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2010 20:14:16 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Feb 2010 20:14:14 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 450AE1761F for ; Mon, 8 Feb 2010 20:13:53 +0000 (GMT) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Mon, 08 Feb 2010 20:13:53 -0000 Message-ID: <20100208201353.15108.29360@eos.apache.org> Subject: =?utf-8?q?=5BCassandra_Wiki=5D_Trivial_Update_of_=22DistributedDeletes=22?= =?utf-8?q?_by_JonathanEllis?= X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for= change notification. The "DistributedDeletes" page has been changed by JonathanEllis. The comment on this change is: unlink ZK. http://wiki.apache.org/cassandra/DistributedDeletes?action=3Ddiff&rev1=3D1&= rev2=3D2 -------------------------------------------------- = Thus, a delete operation can't just wipe out all traces of the data being= removed immediately: if we did, and a replica did not receive the delete o= peration, when it becomes available again it will treat the replicas that d= id receive the delete as having missed a write update, and repair them! So,= instead of wiping out data on delete, Cassandra replaces it with a special= value called a tombstone. The tombstone can then be propagated to replicas= that missed the initial remove request. = - There's one more piece to the problem: how do we know when it's safe to r= emove tombstones? In a fully distributed system, we can't. We could add a c= oordinator like ZooKeeper, but that would pollute the simplicity of the des= ign, as well as complicating ops -- then you'd essentially have two systems= to monitor, instead of one. (This is not to say ZK is bad software -- I be= lieve it is best in class at what it does -- only that it solves a problem = that we do not wish to add to our system.) + There's one more piece to the problem: how do we know when it's safe to r= emove tombstones? In a fully distributed system, we can't. We could add a c= oordinator like !ZooKeeper, but that would pollute the simplicity of the de= sign, as well as complicating ops -- then you'd essentially have two system= s to monitor, instead of one. (This is not to say ZK is bad software -- I b= elieve it is best in class at what it does -- only that it solves a problem= that we do not wish to add to our system.) = So, Cassandra does what distributed systems designers frequently do when = confronted with a problem we don't know how to solve: define some additiona= l constraints that turn it into one that we do. Here, we defined a constant= , GCGraceSeconds, and had each node track tombstone age locally. Once it ha= s aged past the constant, it can be GC'd. This means that if you have a nod= e down for longer than GCGraceSeconds, you should treat it as a failed node= and replace it as described in [[Operations]]. The default setting is very= conservative, at 10 days; you can reduce that once you have Anti Entropy c= onfigured to your satisfaction. And of course if you are only running a sin= gle Cassandra node, you can reduce it to zero, and tombstones will be GC'd = at the first compaction (see MemtableSStables). =20