Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Message-ID: <4043352.83821289252674084.JavaMail.jira@thor>
Date: Mon, 8 Nov 2010 16:44:34 -0500 (EST)
From: "Sylvain Lebresne (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Subject: [jira] Issue Comment Edited: (CASSANDRA-1072) Increment counters
In-Reply-To: <27538947.141273514791054.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/CASSANDRA-1072?page=3Dcom.atlas=
sian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D=
12929750#action_12929750 ]=20

Sylvain Lebresne edited comment on CASSANDRA-1072 at 11/8/10 4:43 PM:
----------------------------------------------------------------------

I think this patch has a number of important shortcomings (all of which hav=
e been discussed in some comments of CASSANDRA-1546, so sorry for the repet=
ition):

# the patch uses IP addresses as node identifiers for the partitions of the=
 counters. This is overly fragile (a change of IP, accidental or not, could=
 corrupt data) and, I'm growing more and more convinced, a bad idea. An obv=
ious solution is to use uuids instead of IPs. However in that perspective, =
I believe the approach taken by CASSANDRA-1546 to be a lot simpler (but I c=
ould be biased) that the clean context logic of this patch. Because the cle=
an context logic requires a global knowledge of the node uuid affectations,=
 while the approach of CASSANDRA-1546 does not.
# cluster topology changes could result in data corruption, if no proper ca=
re is taken by the user. Consider a simple cluster with a single node A (RF=
=3D1), accepting updates on a counter c. We boostrap node B, that gets coun=
ter c in its range (it is thus streamed to B). And now let's say that node =
B is decommissioned. Counter c will be streamed back to A "as is". If, afte=
r B was boostrapped, repair has been run on A, this is fine. But if repair =
wasn't run, it'll result in a (on disk) corrupted counter, because the newl=
y streamed parts will be merged with the old version.  And I don't think th=
at requiring users to run repair at the risk of losing data is the right fi=
x.  This is not unrelated to my previous point in that I believe that with =
uuids, we can fix that by renewing a given node ID on range changes. Again,=
 the approach of CASSANDRA-1546 where we don't need to know the affectation=
 of node ID -> actual node (at least not on the read and/or write path) mak=
e that much easier.=20
# there is a race condition during reads. During reads, a given row can be =
read twice, because the switch from current memtable to memtable pending fl=
ush is not atomic. The same is true when a flushed memtable becomes a sstab=
le and at the end of compaction. This is fine for normal reads, but will re=
sult in bogus reads for counters. The patch attached to CASSANDRA-1546 prop=
oses a fix to that.
# there is no replication on writes. Which is worst than merely not support=
ing CL.QUORUM. This patch does provide any reasonable durability guarantee.=
 And imho, this is far too important to be simply left as a 'later improvem=
ent'.


      was (Author: slebresne):
    I think this patch has a number of important shortcomings (all of which=
 have been discussed in some comments of CASSANDRA-1546, so sorry for the r=
epetition):

# the patch uses IP addresses as node identifiers for the partitions of the=
 counters. This is overly fragile (a change of IP, accidental or not, could=
 corrupt data) and, I'm growing more and more convinced, a bad idea. An obv=
ious solution is to use uuids instead of IPs. However in that perspective, =
I believe the approach taken by CASSANDRA-1546 to be a lot simpler (but I c=
ould be biased) that the clean context logic of this patch. Because the cle=
an context logic requires a global knowledge of the node uuid affectations,=
 while the approach of CASSANDRA-1546 does not.

# cluster topology changes could result in data corruption, if no proper ca=
re is taken by the user. Consider a simple cluster with a single node A (RF=
=3D1), accepting updates on a counter c. We boostrap node B, that gets coun=
ter c in its range (it is thus streamed to B). And now let's say that node =
B is decommissioned. Counter c will be streamed back to A "as is". If, afte=
r B was boostrapped, repair has been run on A, this is fine. But if repair =
wasn't run, it'll result in a (on disk) corrupted counter, because the newl=
y streamed parts will be merged with the old version.  And I don't think th=
at requiring users to run repair at the risk of losing data is the right fi=
x.  This is not unrelated to my previous point in that I believe that with =
uuids, we can fix that by renewing a given node ID on range changes. Again,=
 the approach of CASSANDRA-1546 where we don't need to know the affectation=
 of node ID -> actual node (at least not on the read and/or write path) mak=
e that much easier.  # there is a race condition during reads. During reads=
, a given row can be read twice, because the switch from current memtable t=
o memtable pending flush is not atomic. The same is true when a flushed mem=
table becomes a sstable and at the end of compaction. This is fine for norm=
al reads, but will result in bogus reads for counters. The patch attached t=
o CASSANDRA-1546 proposes a fix to that.

# there is no replication on writes. Which is worst than merely not support=
ing CL.QUORUM. This patch does provide any reasonable durability guarantee.=
 And imho, this is far too important to be simply left as a 'later improvem=
ent'.

 =20
> Increment counters
> ------------------
>
>                 Key: CASSANDRA-1072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1072
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
>
>
> Break out the increment counters out of CASSANDRA-580. Classes are shared=
 between the two features but without the plain version vector code the cha=
ngeset becomes smaller and more manageable.

--=20
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.