cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-4775) Counters 2.0
Date Tue, 26 Mar 2013 10:55:15 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613693#comment-13613693
] 

Sylvain Lebresne commented on CASSANDRA-4775:
---------------------------------------------

For the record, I'd like to "quickly" sum up what are the problems of the counters 1.0, and
more precisely what I think are problems inherent to the design, and what I believe might
be fixable with some effort.

The current counter implementation is based on the idea of internally keeping one separated
sub-counter (or "shard") for each replica of the counter, and making sure that for each increment,
one shard and only one is ever incremented. The latter is ensure by the special write path
of counters that:
* pick a live replica and forward it the increment
* have that replica increment it's own "shard" locally
* then have the replica send the *result* of this local shard increment to the other replicas

This mechanism have (at least) the following problems:
# counters cannot be retried safely on timeout.
# removing counters works only halfway. If you re-increment a deleted counter too soon, the
result is somewhat random. 

Those problems are largely due to the general mechanism used, not to implementation details.
That being said, on the retry problem, I'll note that while I don't think we can fix it in
the current mechanism, tickets like CASSANDRA-3199 could mitigate it somewhat by making TimeoutException
less likely.

Other problems are more due to how the implementation works. More precisely, they are due
to how a replica proceed to incrementing it's own shard. To do that, the implementation uses
separated merging rules for "local" shards and "remote" ones. Namely, local shards are summed
during merge (so the sub-count they contain is used as a delta) while for remote ones, the
"biggest" value is kept (where "biggest" means "the one with the biggest clock"). So for remote
shards, conflicts are handled as "latests wins" as usual. The reason for that difference between
local and remote shards is a performance one: when a replica needs to increment his shard,
it needs to do that "atomically". So if local shard were handled like remote ones, then to
increment the local shard we would need to 1) grab a lock, 2) read the current value, 3) increment
it, 4) write it and then 5) release the lock. Instead, in the current implementation, the
replica just insert an increment to his own shard. And to find the total value of its local
shard, it just read and increments get merged on reads. In practice, what we win is that we
don't have to grab a lock.

However, I believe that "implementation detail" is responsible for a fair amount of the pain
counters are. In particular it complicates the implementation substantially because:
* a local shard on one replica is a remote shard on another replica. We handle this by transforming
shards during deserialization, which is complex and fragile. It's also the source of CASSANDRA-4071
(and at least one contributor to CASSANDRA-4417).
* we have to be extremely careful not to duplicate a local shard internally or we'll over-count.
The storage engine having been initially designed with the idea that using the same column
twice was harmless, this has led to a number of bugs.

We could change that "implementation detail". Instead we could stop distinguishing the merge
rules for local shard, and when a replica need to increment his hard, he would read/increment/write
while holding a lock to ensure atomicity. This would likely simplify the implementation and
fix CASSANDRA-4071 and CASSANDRA-4417. Of course, this would still not fix the other top-level
problems (not being able to replay, broken remove, ....).

                
> Counters 2.0
> ------------
>
>                 Key: CASSANDRA-4775
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4775
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>              Labels: counters
>             Fix For: 2.0
>
>
> The existing partitioned counters remain a source of frustration for most users almost
two years after being introduced.  The remaining problems are inherent in the design, not
something that can be fixed given enough time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message