cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4071) Topology changes can lead to bad counters (at RF=1)
Date Wed, 24 Oct 2012 16:50:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483381#comment-13483381
] 

Sylvain Lebresne commented on CASSANDRA-4071:
---------------------------------------------

I'm beginning to think we've made a mistake in the counters design. Namely, when we write
a new increment to the "leader", we write the increment, then read (which merge the new increments
to the previous ones), then send that to the other replica. But that's why we have all the
delta business that is the root cause for this and for CASSANDRA-4417 (and we're not even
sure we understand all the case that can produce the error message of CASSANDRA-4417).

An alternative would be instead that when the leader receives a new increment, it reads, apply
the increment to the value read, and write the result. If we do that, we don't need delta
anymore, fixing this issue as well as CASSANDRA-4417.  We also don't ever have to renew nodeId,
so we also fix the problem of increasing counter context. And overall we greatly simplify
the code. There would be clear performance downsides however:
# we will have to lock during the read, apply increment, write result dance. 
# we still read before the first write, so replicate_on_write won't be an option anymore (I've
always been clear how I personally feel about this option in the first place so that would
almost be an advantage in my opinion, but some disagree). But it will also increase the latency
of writes at CL.ONE.

But even if we decide to go that route, another thing to go into account is that I don't know
how to support the upgrade to this new way of doing things without requiring a major compaction
on upgrade (which is particularly a problem for LeveledCompaction because we don't even know
how to major compact).

So definitively not perfect, but the best idea I've had so far.

                
> Topology changes can lead to bad counters (at RF=1)
> ---------------------------------------------------
>
>                 Key: CASSANDRA-4071
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4071
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.0.0
>            Reporter: Sylvain Lebresne
>              Labels: counters
>
> A counter is broken into shards (partitions), each shard being 'owned' by a given replica
(meaning that only this replica will increment that shard).  For a given node A, the resolution
of 2 shards (having the same owner) follows the following rules:
> * if the shards are owned by A, then sum the values (in the original patch, 'owned by
A' was based on the machine IP address, in the current code, it's based on the shard having
a delta flag but the principle is the same)
> * otherwise, keep the maximum value (based on the shards clocks)
> During topology changes (boostrap/move/decommission), we transfer data from A to B, but
the shards owned by A are not owned by B (and we cannot make them owned by B because during
those operations (boostrap, ...) a given shard would be owned by A and B which would break
counters). But this means that B won't interpret the streamed shards correctly.
> Concretely, if A receives a number of counter increments that end up in different sstables
(the shards should thus be summed) and then those increments are streamed to B as part of
boostrap, B will not sum the increments but use the clocks to keep the maximum value.
> I've pushed a test that show the breakeage at https://github.com/riptano/cassandra-dtest/commits/counters_test
(the test needs CASSANDRA-4070 to work correctly).
> Note that in practice, replication will hide this (even though B will have the bad value
after the boostrap, read or read/repair from the other replica will repair it). This is a
problem for RF=1 however.
> Another problem is that during repair, a node won't correctly repair other nodes on it's
own shards (unless everything is fully compacted).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message