cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Favre-Felix (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4775) Counters 2.0
Date Fri, 05 Jul 2013 16:01:54 GMT


Nicolas Favre-Felix commented on CASSANDRA-4775:

A few comments on the design posted above in a GitHub gist:

* The "time" part of the client-provided TimeUUID is now compared to the server's timestamp
in the test "if(time(update-timeuuid) < now() - counter_write_window)". This is not ideal
in my opinion, but I guess Cassandra is now using "real" timestamps a lot more than it used
to. In any case, an "old" delta could also fall behind a "merge" cell and be ignored on read.
* Having "merge cells" means that we could support both TTLs and "put" operations on counters,
as long as the semantics are well defined.
* Could counter merges happen in the background at ALL since most reads will receive responses
from all replicas anyway, or would we always require QUORUM writes? This could be too restrictive
in multi-DC deployments where most people probably prefer to read and write at LOCAL_QUORUM.

As described above, finding a "merge point" at which we could roll-up deltas involves either
QUORUM reads + QUORUM writes or a read at ALL. This is necessary since we need a majority
of replicas to persist the merge cell. We could consider this "set of deltas" that make up
a counter to be merged at different levels, though. When this set is common to all replicas
(as is proposed above), we can only merge in QUORUM reads if we can _guarantee_ quorum writes
or must merge in reads at ALL otherwise.

If, instead, we shard this "set of deltas" among replicas with a distribution scheme resembling
the existing implementation, each replica becomes the "coordinator" for a fraction of deltas
and can (on its own) merge the increments for which it is responsible and issue "merge cells"
to replace them. It becomes possible to read and write at LOCAL_QUORUM using this scheme.
As any particular replica is the only source of truth for the subset of deltas that it was
assigned, it does _by definition_ read ALL of its deltas and can sum them up with no risk
of inconsistency. When these cells are node-local with a single source of truth, they can
be merged by their owner and a merge cell replicated easily.
The main issue with this implementation is the choice of the coordinator node for an increment
operation: if we assign a replica at random, retrying would lead to duplicates; if we assign
a replica deterministically (based on the operation's UUID for example) we risk not being
able to write to the counter if that particular replica goes down.

I'd like to propose a solution that lies between merging counters across the whole cluster
and merging counters in each individual replica:
We can shard counters based on the datacenter, and roll-up these UUIDs per DC. In that case,
the scope of the set of replicas involved in merging deltas together is therfore limited to
the replicas of the local DC, which (once again) can merge deltas by either getting involved
at W.QUORUM+R.QUORUM or W.anything+R.ALL.
A configuration flag per counter CF would configure whether we require W.QUORUM+R.QUORUM (default)
or let clients write with any CL with the downside that deltas can only be merged at CL.ALL.
The same issue with retries applies here, albeit at a different level: a particular operation
can only be retried safely if it sent to the same datacenter, which seems reasonable.

I believe that the space and time overheads are about the same as in Aleksey's design.

Suggestions and ideas much welcome.

> Counters 2.0
> ------------
>                 Key: CASSANDRA-4775
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Arya Goudarzi
>            Assignee: Aleksey Yeschenko
>              Labels: counters
>             Fix For: 2.1
> The existing partitioned counters remain a source of frustration for most users almost
two years after being introduced.  The remaining problems are inherent in the design, not
something that can be fixed given enough time/eyeballs.
> Ideally a solution would give us
> - similar performance
> - less special cases in the code
> - potential for a retry mechanism

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message