cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-1072) Increment counters
Date Mon, 08 Nov 2010 21:44:22 GMT


Sylvain Lebresne commented on CASSANDRA-1072:

I think this patch has a number of important shortcomings (all of which have been discussed
in some comments of CASSANDRA-1546, so sorry for the repetition):

# the patch uses IP addresses as node identifiers for the partitions of the counters. This
is overly fragile (a change of IP, accidental or not, could corrupt data) and, I'm growing
more and more convinced, a bad idea. An obvious solution is to use uuids instead of IPs. However
in that perspective, I believe the approach taken by CASSANDRA-1546 to be a lot simpler (but
I could be biased) that the clean context logic of this patch. Because the clean context logic
requires a global knowledge of the node uuid affectations, while the approach of CASSANDRA-1546
does not.

# cluster topology changes could result in data corruption, if no proper care is taken by
the user. Consider a simple cluster with a single node A (RF=1), accepting updates on a counter
c. We boostrap node B, that gets counter c in its range (it is thus streamed to B). And now
let's say that node B is decommissioned. Counter c will be streamed back to A "as is". If,
after B was boostrapped, repair has been run on A, this is fine. But if repair wasn't run,
it'll result in a (on disk) corrupted counter, because the newly streamed parts will be merged
with the old version.  And I don't think that requiring users to run repair at the risk of
losing data is the right fix.  This is not unrelated to my previous point in that I believe
that with uuids, we can fix that by renewing a given node ID on range changes. Again, the
approach of CASSANDRA-1546 where we don't need to know the affectation of node ID -> actual
node (at least not on the read and/or write path) make that much easier.  # there is a race
condition during reads. During reads, a given row can be read twice, because the switch from
current memtable to memtable pending flush is not atomic. The same is true when a flushed
memtable becomes a sstable and at the end of compaction. This is fine for normal reads, but
will result in bogus reads for counters. The patch attached to CASSANDRA-1546 proposes a fix
to that.

# there is no replication on writes. Which is worst than merely not supporting CL.QUORUM.
This patch does provide any reasonable durability guarantee. And imho, this is far too important
to be simply left as a 'later improvement'.

> Increment counters
> ------------------
>                 Key: CASSANDRA-1072
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>          Components: Core
>            Reporter: Johan Oskarsson
>            Assignee: Kelvin Kakugawa
>         Attachments: CASSANDRA-1072.patch, Incrementcountersdesigndoc.pdf
> Break out the increment counters out of CASSANDRA-580. Classes are shared between the
two features but without the plain version vector code the changeset becomes smaller and more

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message