Return-Path: Delivered-To: apmail-incubator-cassandra-commits-archive@minotaur.apache.org Received: (qmail 22734 invoked from network); 17 Apr 2009 19:10:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Apr 2009 19:10:38 -0000 Received: (qmail 73814 invoked by uid 500); 17 Apr 2009 19:10:38 -0000 Delivered-To: apmail-incubator-cassandra-commits-archive@incubator.apache.org Received: (qmail 73786 invoked by uid 500); 17 Apr 2009 19:10:38 -0000 Mailing-List: contact cassandra-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-commits@incubator.apache.org Received: (qmail 73776 invoked by uid 99); 17 Apr 2009 19:10:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Apr 2009 19:10:38 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 17 Apr 2009 19:10:36 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id E0440234C004 for ; Fri, 17 Apr 2009 12:10:14 -0700 (PDT) Message-ID: <1729571457.1239995414903.JavaMail.jira@brutus> Date: Fri, 17 Apr 2009 12:10:14 -0700 (PDT) From: "Jonathan Ellis (JIRA)" To: cassandra-commits@incubator.apache.org Subject: [jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code In-Reply-To: <1647103922.1238593754134.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-33: ------------------------------------ Attachment: 0004-and-5-v2.patch fixed tests to make it more obvious what should be happening. > Bugs in tombstone handling in remove code > ----------------------------------------- > > Key: CASSANDRA-33 > URL: https://issues.apache.org/jira/browse/CASSANDRA-33 > Project: Cassandra > Issue Type: Bug > Reporter: Jonathan Ellis > Assignee: Jonathan Ellis > Fix For: 0.3 > > Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch > > > [copied from dev list] > Avinash pointed out two bugs in my remove code. One is easy to fix, > the other is tougher. > The easy one is that my code removes tombstones (deletion markers) at > the ColumnFamilyStore level, so when CassandraServer does read repair > it will not know about the tombstones and they will not be replicated > correctly. This can be fixed by simply moving the removeDeleted call > up to just before CassandraServer's final return-to-client. > The hard one is that tombstones are problematic on GC (that is, major > compaction of SSTables, to use the Bigtable paper terminology). > One failure scenario: Node A, B, and C replicate some data. C goes > down. The data is deleted. A and B delete it and later GC it. C > comes back up. C now has the only copy of the data so on read repair > the stale data will be sent to A and B. > A solution: pick a number N such that we are confident that no node > will be down (and catch up on hinted handoffs) for longer than N days. > (Default value: 10?) Then, no node may GC tombstones before N days > have elapsed. Also, after N days, tombstones will no longer be read > repaired. (This prevents a node which has not yet GC'd from sending a > new tombstone copy to a node that has already GC'd.) > Implementation detail: we'll need to add a 32-bit "time of tombstone" > to ColumnFamily and SuperColumn. (For Column we can stick it in the > byte[] value, since we already have an unambiguous way to know if the > Column is in a deleted state.) We only need 32 bits since the time > frame here is sufficiently granular that we don't need ms. Also, we > will use the system clock for these values, not the client timestamp, > since we don't know what the source of the client timestamps is. > Admittedly this is suboptimal compared to being able to GC immediately > but it has the virtue of being (a) easily implemented, (b) with no > extra components such as a coordination protocol, and (c) better than > not GCing tombstones at all (the other easy way to ensure > correctness). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.