Return-Path: Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: (qmail 96020 invoked from network); 8 Nov 2010 21:44:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Nov 2010 21:44:26 -0000 Received: (qmail 61238 invoked by uid 500); 8 Nov 2010 21:44:58 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 61176 invoked by uid 500); 8 Nov 2010 21:44:58 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 61168 invoked by uid 99); 8 Nov 2010 21:44:58 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 21:44:58 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Nov 2010 21:44:55 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oA8LiYYV027749 for ; Mon, 8 Nov 2010 21:44:34 GMT Message-ID: <4043352.83821289252674084.JavaMail.jira@thor> Date: Mon, 8 Nov 2010 16:44:34 -0500 (EST) From: "Sylvain Lebresne (JIRA)" To: commits@cassandra.apache.org Subject: [jira] Issue Comment Edited: (CASSANDRA-1072) Increment counters In-Reply-To: <27538947.141273514791054.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=3Dcom.atlas= sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D= 12929750#action_12929750 ]=20 Sylvain Lebresne edited comment on CASSANDRA-1072 at 11/8/10 4:43 PM: ---------------------------------------------------------------------- I think this patch has a number of important shortcomings (all of which hav= e been discussed in some comments of CASSANDRA-1546, so sorry for the repet= ition): # the patch uses IP addresses as node identifiers for the partitions of the= counters. This is overly fragile (a change of IP, accidental or not, could= corrupt data) and, I'm growing more and more convinced, a bad idea. An obv= ious solution is to use uuids instead of IPs. However in that perspective, = I believe the approach taken by CASSANDRA-1546 to be a lot simpler (but I c= ould be biased) that the clean context logic of this patch. Because the cle= an context logic requires a global knowledge of the node uuid affectations,= while the approach of CASSANDRA-1546 does not. # cluster topology changes could result in data corruption, if no proper ca= re is taken by the user. Consider a simple cluster with a single node A (RF= =3D1), accepting updates on a counter c. We boostrap node B, that gets coun= ter c in its range (it is thus streamed to B). And now let's say that node = B is decommissioned. Counter c will be streamed back to A "as is". If, afte= r B was boostrapped, repair has been run on A, this is fine. But if repair = wasn't run, it'll result in a (on disk) corrupted counter, because the newl= y streamed parts will be merged with the old version. And I don't think th= at requiring users to run repair at the risk of losing data is the right fi= x. This is not unrelated to my previous point in that I believe that with = uuids, we can fix that by renewing a given node ID on range changes. Again,= the approach of CASSANDRA-1546 where we don't need to know the affectation= of node ID -> actual node (at least not on the read and/or write path) mak= e that much easier.=20 # there is a race condition during reads. During reads, a given row can be = read twice, because the switch from current memtable to memtable pending fl= ush is not atomic. The same is true when a flushed memtable becomes a sstab= le and at the end of compaction. This is fine for normal reads, but will re= sult in bogus reads for counters. The patch attached to CASSANDRA-1546 prop= oses a fix to that. # there is no replication on writes. Which is worst than merely not support= ing CL.QUORUM. This patch does provide any reasonable durability guarantee.= And imho, this is far too important to be simply left as a 'later improvem= ent'. was (Author: slebresne): I think this patch has a number of important shortcomings (all of which= have been discussed in some comments of CASSANDRA-1546, so sorry for the r= epetition): # the patch uses IP addresses as node identifiers for the partitions of the= counters. This is overly fragile (a change of IP, accidental or not, could= corrupt data) and, I'm growing more and more convinced, a bad idea. An obv= ious solution is to use uuids instead of IPs. However in that perspective, = I believe the approach taken by CASSANDRA-1546 to be a lot simpler (but I c= ould be biased) that the clean context logic of this patch. Because the cle= an context logic requires a global knowledge of the node uuid affectations,= while the approach of CASSANDRA-1546 does not. # cluster topology changes could result in data corruption, if no proper ca= re is taken by the user. Consider a simple cluster with a single node A (RF= =3D1), accepting updates on a counter c. We boostrap node B, that gets coun= ter c in its range (it is thus streamed to B). And now let's say that node = B is decommissioned. Counter c will be streamed back to A "as is". If, afte= r B was boostrapped, repair has been run on A, this is fine. But if repair = wasn't run, it'll result in a (on disk) corrupted counter, because the newl= y streamed parts will be merged with the old version. And I don't think th= at requiring users to run repair at the risk of losing data is the right fi= x. This is not unrelated to my previous point in that I believe that with = uuids, we can fix that by renewing a given node ID on range changes. Again,= the approach of CASSANDRA-1546 where we don't need to know the affectation= of node ID -> actual node (at least not on the read and/or write path) mak= e that much easier. # there is a race condition during reads. During reads= , a given row can be read twice, because the switch from current memtable t= o memtable pending flush is not atomic. The same is true when a flushed mem= table becomes a sstable and at the end of compaction. This is fine for norm= al reads, but will result in bogus reads for counters. The patch attached t= o CASSANDRA-1546 proposes a fix to that. # there is no replication on writes. Which is worst than merely not support= ing CL.QUORUM. This patch does provide any reasonable durability guarantee.= And imho, this is far too important to be simply left as a 'later improvem= ent'. =20 > Increment counters > ------------------ > > Key: CASSANDRA-1072 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1072 > Project: Cassandra > Issue Type: Sub-task > Components: Core > Reporter: Johan Oskarsson > Assignee: Kelvin Kakugawa > Attachments: CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf > > > Break out the increment counters out of CASSANDRA-580. Classes are shared= between the two features but without the plain version vector code the cha= ngeset becomes smaller and more manageable. --=20 This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.