Return-Path: Delivered-To: apmail-incubator-cassandra-commits-archive@minotaur.apache.org Received: (qmail 80987 invoked from network); 20 Apr 2009 16:30:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Apr 2009 16:30:11 -0000 Received: (qmail 15959 invoked by uid 500); 20 Apr 2009 16:30:11 -0000 Delivered-To: apmail-incubator-cassandra-commits-archive@incubator.apache.org Received: (qmail 15938 invoked by uid 500); 20 Apr 2009 16:30:11 -0000 Mailing-List: contact cassandra-commits-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-commits@incubator.apache.org Received: (qmail 15928 invoked by uid 99); 20 Apr 2009 16:30:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2009 16:30:11 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2009 16:30:09 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id D0AEE234C046 for ; Mon, 20 Apr 2009 09:29:47 -0700 (PDT) Message-ID: <302016062.1240244987853.JavaMail.jira@brutus> Date: Mon, 20 Apr 2009 09:29:47 -0700 (PDT) From: "Jun Rao (JIRA)" To: cassandra-commits@incubator.apache.org Subject: [jira] Updated: (CASSANDRA-33) Bugs in tombstone handling in remove code In-Reply-To: <1647103922.1238593754134.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-33?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jun Rao updated CASSANDRA-33: ----------------------------- Attachment: 0007_fix_another_sequencefile_bug.patch Include patch for another bug in SequeceFile.java where the size of the row is not calculated correctly. At this moment, this bug is not exposed.since the rowkey and the size of row written to outBuf were simply read and discarded in SSTable.next(). We should probably open another issue to clean up the code such that the row key and row size are not written to outBuf. > Bugs in tombstone handling in remove code > ----------------------------------------- > > Key: CASSANDRA-33 > URL: https://issues.apache.org/jira/browse/CASSANDRA-33 > Project: Cassandra > Issue Type: Bug > Reporter: Jonathan Ellis > Assignee: Jonathan Ellis > Fix For: 0.3 > > Attachments: 0001-preserve-tombstones-until-a-GC-grace-period-has-elap.patch, 0002-omit-tombstones-from-column_t-and-supercolumn_t-retu.patch, 0003-make-GC_GRACE_IN_SECONDS-customizable-in-storage.con.patch, 0004-and-5-v2.patch, 0004_expose_remove_bug.patch, 0005_fix_exposed_remove_bug.patch, 0006_fix_sequencefile_bug.patch, 0007_fix_another_sequencefile_bug.patch > > > [copied from dev list] > Avinash pointed out two bugs in my remove code. One is easy to fix, > the other is tougher. > The easy one is that my code removes tombstones (deletion markers) at > the ColumnFamilyStore level, so when CassandraServer does read repair > it will not know about the tombstones and they will not be replicated > correctly. This can be fixed by simply moving the removeDeleted call > up to just before CassandraServer's final return-to-client. > The hard one is that tombstones are problematic on GC (that is, major > compaction of SSTables, to use the Bigtable paper terminology). > One failure scenario: Node A, B, and C replicate some data. C goes > down. The data is deleted. A and B delete it and later GC it. C > comes back up. C now has the only copy of the data so on read repair > the stale data will be sent to A and B. > A solution: pick a number N such that we are confident that no node > will be down (and catch up on hinted handoffs) for longer than N days. > (Default value: 10?) Then, no node may GC tombstones before N days > have elapsed. Also, after N days, tombstones will no longer be read > repaired. (This prevents a node which has not yet GC'd from sending a > new tombstone copy to a node that has already GC'd.) > Implementation detail: we'll need to add a 32-bit "time of tombstone" > to ColumnFamily and SuperColumn. (For Column we can stick it in the > byte[] value, since we already have an unambiguous way to know if the > Column is in a deleted state.) We only need 32 bits since the time > frame here is sufficiently granular that we don't need ms. Also, we > will use the system clock for these values, not the client timestamp, > since we don't know what the source of the client timestamps is. > Admittedly this is suboptimal compared to being able to GC immediately > but it has the virtue of being (a) easily implemented, (b) with no > extra components such as a coordination protocol, and (c) better than > not GCing tombstones at all (the other easy way to ensure > correctness). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.