cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4071) Topology changes can lead to bad counters (at RF=1)
Date Tue, 04 Dec 2012 14:57:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509781#comment-13509781
] 

Sylvain Lebresne commented on CASSANDRA-4071:
---------------------------------------------

bq. Could we introduce something like gc_g_s for node ids so renewing doesn't have to be avoided
so religiously?

The problem is that when we renew a counterId, the existing shards are still parts of all
those counters. So the only way to get rid of those shards would be to merge their values
into some newer shard. But as far as I can tell (and I've put quite some though into that),
this is really hard to do correctly, because it's not something you can do in isolation of
other nodes.

That being said, we do have some mechanism that does exactly that shard merging in a few cases.
I.e. if a node detects that a counter has 2 shards for 2 old counterId of itself, it merges
them in two phases: first we merge one of the shard into the other but we keep the first shard
with a value of 0. And only when we consider this has been propagated to all replica (after
gc_grace basically), do we remove the zeroed shard. However:
# this shard merging is already a fairly dark and hacky corner of the counter implementation.
I'd rather remove it than complicate it further.
# it only works for counter that don't change ownership basically. As soon as a counter changes
ownership, whatever old counterId it has won't ever be merged.  The reason is that a  node
only knows about it's own counterId because that's the only simple way I've found to ensure
2 nodes don't start doing the 2-phase merge on the same counterIds at the same time.

                
> Topology changes can lead to bad counters (at RF=1)
> ---------------------------------------------------
>
>                 Key: CASSANDRA-4071
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4071
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Sylvain Lebresne
>              Labels: counters
>
> A counter is broken into shards (partitions), each shard being 'owned' by a given replica
(meaning that only this replica will increment that shard).  For a given node A, the resolution
of 2 shards (having the same owner) follows the following rules:
> * if the shards are owned by A, then sum the values (in the original patch, 'owned by
A' was based on the machine IP address, in the current code, it's based on the shard having
a delta flag but the principle is the same)
> * otherwise, keep the maximum value (based on the shards clocks)
> During topology changes (boostrap/move/decommission), we transfer data from A to B, but
the shards owned by A are not owned by B (and we cannot make them owned by B because during
those operations (boostrap, ...) a given shard would be owned by A and B which would break
counters). But this means that B won't interpret the streamed shards correctly.
> Concretely, if A receives a number of counter increments that end up in different sstables
(the shards should thus be summed) and then those increments are streamed to B as part of
boostrap, B will not sum the increments but use the clocks to keep the maximum value.
> I've pushed a test that show the breakeage at https://github.com/riptano/cassandra-dtest/commits/counters_test
(the test needs CASSANDRA-4070 to work correctly).
> Note that in practice, replication will hide this (even though B will have the bad value
after the boostrap, read or read/repair from the other replica will repair it). This is a
problem for RF=1 however.
> Another problem is that during repair, a node won't correctly repair other nodes on it's
own shards (unless everything is fully compacted).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message